AD-A173  022 


,  Productivity  Engineering  in  the  UNIXf  Environment 


The  Design  and  Evaluation  of  a  High  Performance  Smalltalk  System 


Technical  Report 


S.  L.  Graham 
Principal  Investigator 

(415)  642-2059 


DTIC 


■v 


r’v 

•.-'I 

ft 

A 


4 


“The  views  and  conclusions  contained  in  this  document  are  those  of  the  authors  and 
should  not  be  interpreted  as  representing  the  official  policies,  either  expressed  or  implied, 
of  the  Defense  Advanced  Research  Projects  Agency  or  the  U.S.  Government.’’ 


G 

a  , 


s 


> 

> 

v 

V 

V 

> 


V 


s 


5 


Contract  No.  N00039-84-C-008Q 
August  7,  1984  •  August  6,  1987 


Arpa  Order  No.  4871 


fUNIX  is  a  trademark  of  AT&T  Bell  Laboratories 


CLEARED 

FOR  OPEN  PUBLICATION 

SEP  23  886  3 

OmECTORATt  EUR  FREE 00V  Or  INFORMATION 
AND  SECURITY  REVIEW  (OASO -PA) 
DEPARTMENT  of  cefense 


Thi  doormont  has  boon  approved 

for  p  bli;  roloa.Ta  and  sale;  its 
d:  i  '■  lion  is  unlimited. 


86  4034 


86  &  Z<f  Of*. 


V 

V 


\ 

l 


I 

\ 


The  Design  and  Evaluation  of 
A  High  Performance 
Smalltalk  System 


David  Michael  Ungar 
February,  1986 


Abstract 


The  Smalltalk-80 system  makes  it  possible  to  write  programs  quickly  by  providing 
object-oriented  programming,  incremental  compilation,  run-time  type  checking,  use-extensible  data 
types  and  control  structures,  and  an  interactive  graphical  interface.  However,  the  potential  savings 
in  programming  effort  have  been  curtailed  by  poor  performance  in  widely  available  computers  or 
high  processor  cost.  Smalltalk-80  systems  pose  tough  challenges  for  implementors:  dynamic  data 
typing,  a  high-level  instruction  set,  frequent  and  expensive  procedure  calls,  and  object-oriented 
storage  management. 

The  dissertation  documents  two  results  that  run  counter  to  conventional  wisdom:  that  a 
reduced  instruction  set  computer  can  offer  excellent  performance  for  a  system  with  dynamic  data 
typing  such  as  Smalltalk-80,  and  that  automatic  storage  reclamation  need  not  be  time-consuming.  ^ 
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1).  A 


Abstract 

The  Smalltalk-80™  system  makes  it  possible  to  write  programs  quickly  by  providing 
object-oriented  programming,  incremental  compilation,  run-time  type  checking, 
user-extensible  data  types  and  control  structures,  and  an  interactive  graphical  interface. 
However,  the  potential  savings  in  programming  effort  have  been  curtailed  by  poor  perfor¬ 
mance  in  widely  available  computers  or  high  processor  cost  Smalltalk-80  systems  pose 
tough  challenges  for  implementors:  dynamic  data  typing,  a  high-level  instruction  set  fre¬ 
quent  and  expensive  procedure  calls,  and  object-oriented  storage  management 

To  solve  these  problems,  a  group  of  researchers  at  U.  C.  Berkeley  has  designed  and 
built  the  SOAR  (Smalltalk  On  A  RISC)  microprocessor.  In  order  to  determine  the  perfor¬ 
mance  of  Smalltalk-80  on  SOAR  and  to  evaluate  die  importance  of  each  of  the  ideas,  simu¬ 
lations  of  five  representative  benchmarks  have  been  analysed.  The  results  suggest  that: 

•  Six  ideas  substantially  improve  performance:  compilation  to  a  low-level  instruction 
set  multiple  windows  of  on-chip  registers,  caching  the  target  of  a  call  instruction  in  the 
instruction  itself,  byte  insert  and  extract  instructions,  instructions  for  arithmetic  and 
comparison  operations  on  tigged  integers,  and  our  storage  management  algorithm, 
Generation  Scavenging. 


2 

•  Seven  features  contribute  little  to  performance:  shadow  registers  to  simplify  trap 
recovery,  hardware  assistance  for  garbage  collection,  vectored  traps,  addressable  regis¬ 
ters,  clearing  multiple  registers  in  parallel,  conditional  trap  instructions,  and  load-  and 
store-multipie  instructions. 

•  The  language-specific  hardware  in  SOAR  doubles  its  performance  over  a  RISC  II  with 
the  same  cycle  time. 

•  Generation  Scavenging,  a  storage  reclamation  algorithm  developed  by  die  author,  con¬ 
sumes  only  3%  of  the  CPU  time,  in  contrast  to  the  9%  of  comparable  Smalltalk-80  sys¬ 
tems. 

•  Despite  a  five-to-one  handicap  in  basic  cycle  time,  the  NMOS  SOAR  microprocessor 
should  run  as  fast  an  ECL  Dorado  minicomputer. 

The  dissertation  reports  two  results  that  run  counter  to  conventional  wisdom:  that  a 
reduced  instruction  set  computer  can  offer  excellent  performance  for  a  system  with  dynamic 
dam  typing  such  as  Smalltalk-80,  and  that  automatic  storage  reclamation  need  not  be 
time-consuming. 
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Chapter  1 


Introduction 


Mooes  and  Junes  and  ferns  wheels 
the  dizzy  dancing  way  you  feel. 

As  every  fairy  tale  'comes  real 
I've  looked  at  SOAR  that  way. . . 

I’ve  looked  at  SOAR  from  both  sides  now, 
from  win  and  lose,  and  still  somehow 
It’s  SOAR’s  solutions  I  recall. 

1  really  don't  know  SOAR,  at  all. 

“Both  Sides  Now’’, 

(with  apologies  to)  Joni  Mitchell 


Computer  hardware  technology  has  improved  dramatically  in  the  past  decade.  Com¬ 
puters  now  cost  less,  run  faster,  and  have  more  space  for  programs  and  data.  This  advance  in 
hardware  has  created  a  demand  for  larger  and  more  complex  software.  Unfortunately, 
software  productivity  has  not  kept  pace  with  hardware  technology,  leading  to  a  “software 
crisis.” 

Hie  Smalltalk-80  system  provides  an  environment  that  fosters  rapid  program  develop¬ 
ment  The  system  itself  was  developed  on  a  large,  high-speed,  $100,000  personal  computer, 
and  most  commercially  available  microprocessors,  that  are  much  more  widely  available, 
cannot  run  it  even  half  as  fast  Regretfully,  this  lack  of  widely  available  high-performance 
implementations  has  severely  curtailed  the  system’s  acceptance. 

h  may  be  possible  to  surmount  this  obstacle  with  a  reduced  instruction  set  computer 
(RISC)  architecture.  Such  processors  have  demonstrated  excellent  cost-performance  for 
more  conventional  systems.  However,  RISCs  have  an  architectural  style  that  runs  counter  to 
the  conventional  wisdom  for  exploratory  programming  environments,  such  as  Smalltalk-80. 
Instead  of  an  instruction  set  that  reflects  the  semantics  of  the  source  language,  a  RISC 


instruction  set  reflects  the  demands  of  fast  instruction  decoding  and  execution. 

We  have  investigated  whether  a  reduced  instruction  set  computer  can  provide  good 
performance  for  the  Smalltalk-80  system.  To  this  end  we  have  analyzed  the  architecture  of 
and  designed  and  analyzed  the  software  algorithms  for  a  reduced  instruction  set  microcom¬ 
puter  system  intended  to  run  the  Smalltalk-80  exploratory  programming  environment  at  full 
speed.  This  system  matches  the  performance  of  the  fastest  Smalltalk-80  implementations  to 
date  ( 1986),  yet  runs  at  slower  clock  and  memory  speeds.  The  machine  is  called  SOAR,  for 
Smalltalk  On  A  RISC.  Our  colleagues  have  built  two  VLSI  implementations  of  SOAR:  an 
NMOS  chip  (Figure  1.1)  which  has  correctly  run  diagnostics,  and  a  CMOS  chip.  In  addi¬ 
tion,  two  Multibus™-compitible  boards  have  been  designed  by  others  to  host  our  chip  in  a 
Sun  68010  workstation  [B1D83,  Bro84].  Our  ultimate  goal  is  to  demonstrate  SOAR  in  a  run¬ 
ning  Smalltalk-80  system. 

We  have  also  built  Berkeley  Smalltalk  (BS)  [UnP83],  a  Smalltalk  interpreter  for  the 
MC68010  that  runs  on  the  Sun  workstation.  It  has  served  as  a  test  bed  for  many  of  our  ideas 
and  as  a  source  of  information  about  the  time-consuming  operations  required  to  support  die 
Smalltalk-80  system. 

SOAR  is  a  concoction  of  compiler  technology,  run-time  software,  architecture,  and 
VLSI  circuit  design.  This  dissertation  focuses  on  SOAR’s  architecture  and  run-time  support 
software:  what  SOAR  is,  how  it  was  designed,  and  why  it  works. 

•  The  next  chapter  describes  the  previous  work  in  this  area.  It  starts  with  a  brief  descrip¬ 
tion  of  some  exploratory  programming  environments  (EPEs),  with  particular  emphasis 
on  die  Smalltalk-80  EPE.  It  continues  with  a  survey  of  architectures  that  supported 
EPEs.  Until  SOAR,  these  systems  pushed  the  source-level  semantics  into  the 
hardware,  sacrificing  either  simplicity  or  performance.  The  last  part  of  this  chapter 
covers  previous  reduced  instruction  set  computers,  which  were  all  designed  for 
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Figure  I .] :  SMOS  SOAR  chip  Courtesy  of  J  Pendleton  and  S.  Kong 
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languages  in  the  Algol  family.  SOAR  is  die  first  reduced  instruction  set  architecture 
for  an  exploratory  programming  environment 

Chapter  3  enumerates  die  problems  that  Small  talk-80  presents  and  the  solutions  in 
SOAR’s  architecture.  The  effectiveness  of  each  solution  is  represented  by  the  time 
cost  of  its  omission,  based  on  data  gathered  from  simulations.  Table  1.1  summarizes 
these  results. 

Chapter  4  casts  a  critical  eye  on  SOAR’s  architecture.  Simulation  results  show  that  a 
400  ns  SOAR  will  match  the  performance  of  a  70  ns  ECL  minicomputer.  It  will  also 
run  at  about  the  same  speed  as  an  MC68020  microprocessor  with  a  60  ns  clock,  270  ns 
memory,  an  on-chip  instruction  cache,  and  eight  times  more  transistors  than  SOAR. 
To  understand  SOAR’s  speed,  its  architectural  features  are  listed  in  order  of  effective¬ 
ness,  from  successes  to  failures.  These  results  show  that  SOAR'S  language-specific 
features  approximately  double  performance. 

Chapter  3  delves  into  object-oriented  storage  management  —  a  considerable  source  of 
overhead  and  complexity  for  many  Smalltalk-80  systems.  For  SOAR,  we  have  devised 

Smalltalk -80  performance  challenge: 

SOAR  feature  significance 

fnt  Checking: 


tagged  integers 

26% 

two-tone  instructions 

16% 

nterpretation: 

compiling  to  RISC  instructions. 

-100% 

byte  insert/extract  instructions 

33% 

Procedure  Calls: 

register  windows 

46% 

in-line  cache 

33% 

fast  shuffle 

11% 

Object  Oriented  Storage  Manai 

lement: 

direct  pointers 

20% 

generation  scavenging 

10% 

'T'Tv 

vN*>: 


Generation  Scavenging,  a  software  algorithm  that  cuts  automatic  storage  reclamation 
overhead  from  11%  to  3  fc,  reclaims  circular  structures,  and  provides  an  additional 
20%  performance  improvement  by  eliminating  a  level  of  indirection.  In  addition  to 
virtually  eliminating  die  time  cost  of  garbage  collection,  this  algorithm  allows  us  to 
remove  object-oriented  addressing  from  the  architecture. 

Chapter  6  furnishes  some  proposals  for  coping  with  medium  lifetime  objects  and  an 
analytical  investigation  of  them. 

Finally,  the  concluding  chapter  presents  die  lessons  we  have  learned  from  SOAR  and 
our  recommendations  for  future  designs. 

The  appendices  supplement  die  performance  evaluation  of  SOAR’s  architecture: 
Appendix  A  contains  a  detailed  analysis  of  each  feature’s  impact  on  speed  and 
memory  size,  and  Appendix  B  gives  our  raw  performance  data. 
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The  Cedar  programming  environment  was  also  designed  to  enhance  programming  pro¬ 
ductivity,  but  has  taken  a  different  tack  from  Smalltalk  and  Interlisp 
{DeT80,Tei84,Tei83,SZH85,Rov84].  Smalltalk  and  Interiisp  minimize  the  length  of  pro¬ 
grams  and  reduce  the  time  to  change  and  test  them.  This  reduction  in  information  from  the 
programmer,  coupled  with  die  elimination  of  a  link-editing  or  binding  phase,  places  many 
demands  on  the  execution  of  the  program,  which  leads  to  die  issues  we  address  in  this 
dissertation.  In  contrast,  the  Cedar  system  relies  on  a  strongly-typed  language  which  makes 
data  types  and  module  interfaces  explicit  These  features  enhance  the  comprehensibility  and 
maintainability  of  large  systems  and  allow  the  compiler  to  generate  more  efficient  code.  It 
would  seem  that  of  die  ideas  presented  herein,  only  the  storage  management  algorithms 
would  be  important  with  respect  to  an  implementation  of  Ceder. 

This  research  centers  on  one  EPE  in  particular,  die  Smalltalk-80  system.  Although 
other  EPEs  share  some  of  its  features,  we  will  henceforth  concentrate  on  Smalltalk.  Over  a 
decade  ago,  a  small  band  of  adventurers  at  Xerox  PARC  set  out  to  explore  how  computa¬ 
tional  resources  could  help  people  master  the  programming  process.  The  Smalltalk-80  sys¬ 
tem  [GoR83,  Gol81,Gol84,Kra83]  is  their  latest  achievement.  We  have  taken  a  simple 
architecture  and  added  a  few  features,  resulting  in  a  simple  machine  whose  improved 
cost-performance  could  make  the  Smalltalk-80  system  available  to  many  more  people. 

2.1.1.  Object-Oriented  Programming 

The  Smalltalk  systems  introduced  object-oriented  programming,  which  provides 
abstractions  for  structuring  programs  and  reduces  the  code  that  must  be  written. 
Object-oriented  programming  in  Smalltalk-80  has  three  important  aspects: 

•  First  there  are  no  type  declarations  in  Smalltalk-80.  Instead  information  is  kept  at  run¬ 
time  to  resolve  a  variable's  type.  A  variable  may  take  on  many  different  types. 


•  Second,  a  Smalltalk-80  procedure  call  uses  the  type  of  the  first  argument  to  choose  its 
target  routine.  The  first  parameter  of  every  subroutine  has  as  associated  type,  and  the 
subroutines  are  grouped  accordingly.  When  a  Smalltalk-80  system  performs  a  call,  it 
finds  the  routine  associated  with  die  type  of  die  call's  first  argument  As  mentioned 
above,  the  type  is  not  known  in  advance,  so  this  search  must  occur  at  runtime.  This 
overloaded  call  also  makes  it  easier  to  reuse  an  old  routine  with  a  new  type.  When  the 
old  routine  uses  the  new  type,  operations  defined  on  that  type  will  be  chosen  at 
run-time.  It  is  not  eveu  necessary  to  recompile  the  old  routine.  In  other  words,  new 
types  can  be  added  gracefully  to  the  system. 

•  Finally,  types  can  be  defined  as  extensions  of  other  types.  To  define  a  new  type  that  is 
similar  to  an  old  one,  the  programmer  can  give  the  differences,  and  the  new  type  will 
inherit  die  format  and  functions  from  the  old  one. 

The  Smalltalk-80  implementation  has  two  more  features  that  help  its  programmers. 
Far  one  thing,  it  runs  on  a  computer  dedicated  to  one  user.  Freedom  from  competing 
demands  lets  the  system  provide  uniform,  fast  response  time  in  order  to  enhance  produc¬ 
tivity.  The  other  feature  is  automatic  storage  reclamation.  Programmers  of  early 
list-manipulation  systems  found  it  cumbersome  to  free  unused  storage  explicitly.  Instead, 
they  found  ways  to  let  the  run-time  support  software  reclaim  unused  storage  automatically 
[McC60,  C0I6O].  Automatic  reclamation  provided  a  very  important  benefit:  eliminating 
errors  caused  by  releasing  storage  too  early.  Despite  its  advantages,  the  high  overhead  asso¬ 
ciated  with  automatic  storage  reclamation  prevented  widespread  acceptance.  This  barrier 
has  been  removed  by  faster  algorithms. 

Hi.  Shortening  the  Edit-Compile-Test-Debug  Cycle 

In  addition  to  reducing  editing  time,  the  Smalltalk-80  system  reduces  the  time  for  the 
compile,  test,  and  debug  phases  of  software  construction.  Conventional  systems  require  a 


lot  of  time  to  rebuild  a  large  program  after  a  change.  The  Smalltalk-80  system  uses  incre¬ 
mental  compilation  and  dynamic  linking  to  integrate  changes  rapidly. 

•  Incremental  compilation.  To  reduce  the  work  needed  to  incorporate  a  small  textual 
change,  a  system  must  avoid  recompiling  the  whole  program.  Information  in  symbol 
tables  or  parse  trees  must  be  maintained  and  reused  for  the  portion  that  did  not  change. 
Most  systems  supply  separate  compilation  on  a  module-by-module  basis.  Recompila¬ 
tion  frequently  takes  ten  seconds  to  a  minute.  The  SmalltaDc-80  system  provides  a 
much  finer  grain  of  incremental  compilation  and  much  shorter  response  times.  Magpie 
is  a  similar  EPE  for  PASCAL  [DMS84].  It  compiles  after  every  keystroke.  In  this 
system,  there  is  rarely  a  perceptible  delay  to  rebuild  a  program. 

•  Dynamic  linking.  In  a  system  that  does  all  linking  before  execution  starts,  the  pro¬ 
grammer  must  wait  a  while  longer  after  recompiling  a  module  while  die  system  relinks 
the  module  to  the  program’s  other  modules.  The  result  is  that  a  simple  change  to  a 
large  program  takes  a  long  titne.  In  systems  like  Smalltalk-80,  modules  are  not  stati¬ 
cally  bound  together.  Instead,  they  are  connected  as  needed,  dynamically.  Dynamic 
linking  is  essential  to  mainain  short  response  time  for  changing  large  programs. 

•  Source-level  debugging.  Although  most  programmers  construct  their  programs  in  a 
high-level  language,  early  systems  forced  them  to  debug  their  programs  in  terms  of 
machine  instructions  and  machine  daa  types.  Modern  systems  make  debugging  easier 
by  presenting  breakpoints,  errors,  and  variables  in  terms  of  the  HLL  source  code 
instead  of  the  object  code.  For  instance,  they  show  where  execution  is  suspended  in 
the  source  code  and  can  execute  a  line  at  a  time.  In  such  systems,  the  programmer  can 
debug  much  faster  because  he  has  less  work  to  do.  EPEs  go  even  further.  When 
debugging,  the  programmer  can  try  the  effect  of  a  new  statement  by  merely  typing  it 
in.  The  Smalltalk-80  system  will  instantly  compile  and  execute  the  statement  in  the 
context  of  the  suspended  program.  When  the  error  is  located,  it  can  be  corrected 


without  terminating  the  suspended  program.  It  can  be  restarted,  or  single-stepped  from 
the  point  of  the  error.  With  a  system  like  Smalltalk-80,  one  can  debug  a  program  into 
existence. 

The  Smalltalk-80  system  represents  a  compromise  between  compiled  and  interpreted 
systems.  Programmers  can  produce  more  software  when  they  can  incorporate  and  test 
changes  faster  and  when  they  can  take  advantage  of  a  powerful  debugger.  Most  such  sys¬ 
tems  are  interpreters,  saving  much  state  and  interpreting  it  at  runtime.  Of  course,  the  extra 
work  involved  imposes  severe  performance  penalties.  To  run  die  fastest,  a  program  must  do 
the  least  work;  compilers  attempt  to  determine  as  much  as  possible  about  a  program's 
behavior  statically  leaving  a  minimum  of  work  for  runtime.  The  Smalltalk-80  system  is  a 
happy  medium.  Enough  information  is  compiled  out  to  make  good  performance  possible, 
but  enough  is  left  in  to  make  it  easier  to  program. 

2.1.3.  Graphics 

The  Smalltalk-80  system  takes  advantage  of  bitmap  display  hardware  and  pointing 
devices  to  support  multiple  windows,  selecting  by  pointing,  pop-up  menus,  even  diagrams  of 
program  structure  [ShM83].  This  follows  the  adage  that  "A  picture  is  worth  a  thousand 
words.” 

2.1.4.  Rapid  Response 

High  productivity  demands  consistent,  split-second  response  time  [Tha81].  So,  most 
EPEs  we  know  of  use  dedicated  personal,  high-performance  minicomputers. 

2,13.  The  Bad  News 

Why  do  exploratory  computing  environments  remain  largely  experimental?  They 
suffer  from  poor  cost-performance.  For  example,  each  of  the  EPEs  in  Table  2.1  requires  a 
powerful  and  costly  minicomputer  for  each  programmer.  The  research  in  this  dissertation  is 


an  attempt  to  reduce  die  hardware  cost  for  the  Small  talk-80  exploratory  programming 


environment. 


2JL  The  Smalltalk-80  Exploratory  Programming  Environment 

In  1972  Alan  Kay  started  a  group  at  Xerox  PARC  to  explore  how  computational 
resources  could  help  people  master  the  programming  process.  The  Smalltalk-80  system 
[GoR83,  G0I8I.  Gol84,  Kra83]  is  the  culmination  of  their  efforts.  A  dedicated,  powerful  per¬ 
sonal  computer  hosts  this  innovative  system.  Multiple  on-screen  windows,  pop-up  menus, 
and  pointing  distinguish  Smalltalk-80’ s  user  interface  from  older  systems.  The  Smalltalk-80 
language  has  replaced  operating  on  variables  with  sending  messages  to  objects,  and  its 
run-time  system  automatically  reclaims  storage  and  finds  space  to  allocate  new  objects. 

Smalltalk-80’s  greatest  strengths  and  its  worst  weaknesses  result  from  the  same  design 
decision,  dynamic  binding  of  types  to  variables  and  subroutines  to  call  instructions. 
Smalltalk-80's  designers  have  eliminated  type  declarations  from  the  language,  thereby  mak¬ 
ing  it  easier  to  write  and  modify  programs. 

On  the  other  hand,  computing  a  variable’s  type  or  a  call's  destination  on-the-fly  slows 
down  the  system,  or  increases  die  cost  for  a  machine  with  adequate  performance.  The  only 
computer  that  has  demonstrated  universally  acceptable  Smalltalk-80  performance  is  the 
Xerox  Dorado  [LPM81,Pie83,Deu83a].  This  70  ns  ECL  minicomputer  costs  $120,000  (in 
1985)  and  dissipates  over  2  kilowatts,  requiring  an  air-conditioned  room.  Smalltalk-80  sys¬ 
tems  that  run  on  more  conventional,  cheaper  computers,  including  our  own  Berkeley 


Table  2.1:  Some  exnloratorv  nroerammine  environments. 


Environment  Language  Developed  at  Host  CPU  C 


InterLisp-D  lnterLisp  Xerox  PARC  Dorado 

Cedar  Cedar-Mesa  Xerox  PARC  Dorado 

Smalltalk-80  Smalltalk-80  Xerox  PARC  Dorado 

Lisp  Machine  ZetaLisp  Symbolics  Symbolics  3600 


Smalltalk,  suffer  lackluster  performance.  For  example.  Table  2.2  shows  the  performance  of 
die  official  Smalltalk-80  compiler  benchmark  for  several  implementations,  including  a  simu¬ 
lation  of  our  machine.  (See  Section  4.1  for  a  description  of  the  benchmarks.) 


2J.  Reducing  the  Cost  of  EPEs  with  Software  Only 

How  can  we  make  Exploratory  Programming  Environments  more  cost  effective  and 
more  generally  available?  One  way  is  with  clever  software  on  a  cheap,  conventional 
machine.  L.  Peter  Deutsch  and  Alan  Schiffman  have  built  such  a  Smalltalk-80  system  for  a 
10  Mhz  Motorola  68010  microprocessor  [DeS84],  a  conventional  (and  successful)  general 
purpose  microprocessor.  The  68010’s  microcoded  control  unit  implements  a  32-bit, 
register-based  instruction  set  that  tuns  at  memory  speed.  Jumps  pay  a  penalty  to  refill  the 
instruction  pipeline,  and  calls  must  contend  with  register  saving  and  restoring  overhead.  A 
large  flat  address  space  helps  support  systems  like  Smalltalk  and  Lisp  that  require  large,  sin¬ 
gle  address  spaces. 


Although  the  fastest  68010  instruction  is  6  times  slower  than  a  Dorado  microinstruc¬ 
tion,  the  Deutsch-Schiffman  system  runs  Smalltalk-80  only  three  times  slower*  The 


Table  12:  Performance  of  Smalltalk-80  Compiler  Benchmark. 


Machine 


Dorado  Dolphin 
(Xerox)  '<  (Xerox) 


VAX- 11/780 
(DEO 


68010 

(Xerox) 


SOAR  I 
(UCB) 


!  Year  of  introduction 


1978 


1978 


1978 


1984 


1985 


Technology 


ECL 


TTL 


TTL 


NMOS  NMOS 


Cycle  time 


67  ns  180  ns 


200  ns 


400  ns  400  ns 


Virtual  machine 
implementation 


microcode 


assembler 


Object  pointer  size 


16  bits 


32  bits 


Relative  Performance:  Dorado  ■  100%,  larger  is  faster 


(100%) 


11% 


8% 


40% 


103% 


*  The  lynn  ha*  now  boon  pone  to  the  MC 61020.  at  i  SUN  3  workstation.  Thu  processor  m  at  16.67  Mhz.  with 
wait  Matt*  (SSSS3).  The  fastest  possible  metrucuoo  rant  in  thrac  clock  cycles,  or  ISO  a*.  The  memory  system  cas  deliver  a 
32-bit  word  in  270  ns.  So,  the  cycle  time  for  e  simple  instruction  would  seem  to  rao|«  from  IK  ns  to  270  as.  depend mf  on 
whether  the  meantctioe  is  cached.  Oa  this  machine,  the  Xerox  6*000  Smalltalk  system  can  axacata  the  compiler  benchmark 
SOU  at  feat  aa  a  Dorado. 


efficiency  improvement  over  the  Dorado  arises  from  the  following  software  techniques: 

•  Dynamic  translation.  Instead  of  being  interpreted,  Smalltalk-80  subroutines  are 
translated  into  68010  instructions  when  first  called.  The  translated  versions  are 
directly  executed  and  then  cached  for  later  use. 

•  In-line  caching.  Each  procedure  call  requires  a  table  lookup  to  find  its  target  subrou¬ 
tine.  Even  though  a  call  could  invoke  many  possible  targets,  there  is  a  simple  way  to 
predict  the  target  of  any  given  call.  95%  of  die  time,  a  call  will  invoke  the  same  rou¬ 
tine  it  did  die  last  time  [DAmb83].  Thus,  after  performing  a  lookup  for  a  call  instruc¬ 
tion,  die  Deutsch-Schiffman  system  overwrites  the  call  to  the  lookup  routine  with  a 
call  to  die  target  routine.  The  next  time  die  call  is  executed,  control  bypasses  the 
lookup  routine  and  goes  directly  to  the  previous  target  Of  course,  die  other  5%  of  the 
time,  the  target  has  changed.  So,  each  subroutine  starts  with  a  check  to  cause  another 
lookup  if  necessary.  In  this  manner,  the  targets  for  subroutine  calls  are  cached  in  the 
instruction  stream,  eliminating  cosdy  lookups. 

•  Volatile  contexts.  The  Smalltalk-80  language  specifies  that  its  activation  records  can 
be  manipulated  like  any  other  objects  in  the  system.  Although  this  simplifies  the 
debugger,  it  creates  more  work  for  calls  and  returns  and  thus  hum  system  perfor¬ 
mance.  For  example,  when  saving  the  program  counter,  a  call  must  first  convert  it 
from  a  pointer  into  a  tagged  integer  offset.  Deutsch  and  Schiffman  have  minimized  the 
overhead  by  providing  multiple  representations  for  activation  records  and  automatic 
conversion  between  them.  In  this  manner,  they  defer  expensive  conversions  as  long  as 
possible.  Since  very  few  activation  records  are  ever  examined  by  the  debugger,  most 
of  these  conversions  are  never  performed  at  all,  significantly  reducing  subroutine  call 


overhead. 


•  Deutsch-Bobraw  deferred  reference-counting.  In  addition  to  activation  records,  a 
Smalltalk-80  system  allocates  a  new  object  every  80  instructions  on  average  [Ung84]. 
This  heavy  burden  can  make  automatic  storage  reclamation  a  system  bottleneck.  In 
this  system,  Deutsch-Bobrow  deferred  reference-counting  [DeB76]  reduces  storage 
reclamation  overhead  to  9%  of  the  total  CPU  time. 

2.4.  Hardware  for  Exploratory  Programming  Environments 

In  addition  to  innovative  software,  special-purpose  hardware  may  further  reduce  die 
cost  of  an  EPE.  In  the  past,  researchers  have  closely  coupled  the  source  language  semantics 
to  the  hardware-supported  operations  and  data  types.  Although  memory-efficient,  this 
approach  has  usually  resulted  in  increased  cost  and  poor  performance.  This  section  exam¬ 
ines  five  computers:  the  RICE  computer,  which  introduced  tags,  the  Burroughs  5700, 
Scheme-79,  and  Symbolics  3600  machines  designed  for  specific  high  level  languages,  and 
the  Katana-32,  another  microprocessor  for  die  Smalltalk-80  system. 

2.4.1.  The  RICE  Computer 

The  R-2  computer  developed  at  Rice  University  was  a  tagged  architecture  with  sub¬ 
script  address  calculation  and  bounds-c becking  hardware  [Feu72]: 

•  A  wide,  62-bit  word  size  allowed  an  array’s  length  and  initial  index  to  accompany  its 
base  address. 

•  A  rich  variety  of  numeric  types,  control  words,  and  address  words  were  encoded  in  the 
R-2’s  four  tag  bits.  (See  Table  2.3.) 

The  R-2  design  simplified  its  compilers,  provided  a  measure  of  protection  for  the  operating 
system,  and  reduced  the  amount  of  data  needed  by  the  debugger.  Although  it  did  not  max¬ 
imize  spec4,  this  design  fostered  sharing  among  many  users  in  a  common  address  space.  To 
our  knowledge,  the  RICE  computer  was  the  first  to  add  tags  to  data. 


mrrmn  VTJtrv 


1112:  4 


H « 


Length 


Present  in  Core 


Initial  Index 


|  Indirect  tags 
Restricted  access 
Direct  tags 

Software  tags  (trace  bits) 
(Trite  lockout 


Figure  2.1:  R-2  address  word  format.  The  length  and  index  of  the  first  element  accompany 
the  base  address. 


Tag _ Meaning  _ 

0000  mixed  or  un tagged 
0001  (unassigned) 

0010  (unassigned) 

001 1  (unassigned) 

0100  real,  single  precision 
0101  54-bit  binary  string  or  integer 
01 10  double  precision 
01 1 1  complex 

1000  undefined  for  normal  operations 

1001  partition  word 

1010  relative  control  word 

101 1  absolute  control  word 

1 100  relative  address,  unchained 

1 101  absolute  address,  unchained 

1110  relative  address,  chained 

1111  absolute  address,  chained _ 

2.42.  The  Burroughs  B5700  and  B6700  Computers 

In  the  sixties  and  early  seventies,  the  Burroughs  Corporation  introduced  the  first  com¬ 
mercial  computers  dedicated  to  a  high-level-language,  their  5000  and  6000  series  [Org73]. 
A  tagged,  stack-oriented  architecture  was  chosen  to  host  an  Algol  superset.  Memory  was  at 
a  premium  in  those  days,  and  its  segmented  virtual  memory  system  enabled  the  B5700  to 


operate  with  only  32,000  words  of  main  memory.  Paradoxically,  adding  3  tag  bits  to  each 
45-bit  memory  word  saved  memory  by  reducing  the  number  of  words  needed.  For  example, 
tags  on  data  reduced  die  size  of  instructions  by  permitting  a  single  add  opcode  to  serve  all 
types  of  numbers.  Tags  also  helped  with  managing  the  stack  and  accessing  data  structures. 
Table  2.4  illustrates  the  6700's  data  formats.  A  substantial  quantity  of  hardware  in  these 
■wfciiiM  was  devoted  to  supporting  stack-based,  block  structured  computation.  The  5700 
and  6700  proved  that  commercial  computers  could  be  designed  for  a  high  level  language. 


2.4J.  Scheme-79 

Scheme-79,  an  early  high-level  language  microprocessor,  directly  executed  a  dialect 
of  Lisp  [SHJ81]. 

•  Each  32-bit  word  contained  one  bit  to  aid  garbage  collection,  seven  bits  of  type  and 
opcode  information,  and  a  24-bit  pointer.  (See  Figure  2.2.) 

•  An  innovative  and  interesting  design.  Scheme* 79  pushed  Lisp  abstractions  to  a  low 
level  to  attain  the  power  of  interpreted  execution  at  lower  cost.  For  example,  many 
opcodes  were  needed  to  maintain  the  correspondence  with  source-level  Lisp 


Table  2.4:  Burroughs  6700  data  formats. 

Class  of  Operand 

Type  of  Word 

Tag] 

numbers 

single-precision 

000 

double-precision  (2  words) 

010 

descriptor  words 

segment 

on 

data 

101 

control  words 

indirect  reference  word 

001 

stuffed  indirect  reference  word 

001 

mark  stack  control  word 

on 

return  control  word 

on 

top-of-stack  control  word 

on 

program  control  word 

in 

GC  type 


Figure  22:  Scheme-79  data  format.  Two  of  these  wants  mike  op  a  list  node. 

primitives.  (See  Table  2.5.)  As  a  result,  microcode,  microsubroutines,  and  nanocode 
were  used  to  fit  the  control  circuitry  on-chip.  Scheme’79  had  good  performance  com¬ 
pared  to  other  interpreters,  but  not  when  compared  to  compiled  Lisp.  This  is  shown  in 
Table  2.6,  from  [Pon83a].  These  data  suggest  that  a  machine  that  is  specialized  for  a 
particular  system  must  also  exploit  compilation  to  attain  high  performance. 


Instead  of  a  linear  sequence  of  instructions,  Scheme-79  used  a  Lisp  binary  tree  for  pro¬ 
gram  control,  each  node  consisting  of  two  words.  T**e  first  word  was  the  instruction 
and  the  second  was  a  pointer  to  the  next  instruction.  The  instruction  format  is  the  same 


Table  2.5;  Some  Scheme-79  opcodes. 
APPLY 
CAR 
CDR 

CLOSURE 

COND 

CONS 

EQ 

FIRST-ARG 

GLOBAL 

LIST 

LOCAL 

NIL 

PROCEDURE 

_ SEQUENCE _ 


Table  2.6:  Performs  w  e  of  the  Scheme  benchmark. 


VAX  1 1/780  Franz  interpreter 
Scheme  chip  (projected) 

VAX  1 1/780  Franz,  complied  (normal  funcall) 
VAX  1 1/780  Franz,  compiled  (local  funcall) 


2  min 
1  min 
8.7  sec 

3  sec 


vv.t:  v*.  i 


as  the  data  format  sbowo  above.  This  Don-sequendal  format  prohibits  instruction  pre¬ 
fetching  and  so  reduces  the  speed  of  macro-instructions. 

•  AH  data,  including  the  stack  contents,  were  kept  in  memory  as  lists.  In  addition  the 
memory  reference  overhead,  this  approach  wasted  time  to  reclaim  list  space  for  tem¬ 
porary  values.  Even  with  a  microcoded  link-reversal  mark-and-sweep  garbage  collec¬ 
tor  [ScW67,  StaSO],  Suss  man  estimated  that  Scheme  would  spend  80%  of  its  time  in 
the  storage  allocator. 

The  Scheme-79  chip  was  fabricated  in  die  M PC-79  Multi-University  Multiproject 
Chip-Set  at  X  «  23  p  (5  micron  line  width).  It  was  7500  p  long  and  5900  p  wide.  One  of 
the  fabricated  chips  ran  small  programs  and  reclaimed  storage.  Fibonacci(20)  took  100  mil¬ 
lion  cycles  (@  1600  ns)  with  a  64KW  memory  that  was  half-full.  Over  two-thirds  of  those 
cycles  were  spent  collecting  garbage.  Scheme-81  is  a  successor  to  Scheme-79  with  more 
aggressive  silicon  technology  (X  -  1.5, 12,000p  w  x  12,000p  h)  [BGH82],  Its  designers  esti¬ 
mate  Scheme-81  would  run  five  times  faster  than  Scheme-79.  This  would  still  run  the 
Scheme  benchmark  more  slowly  than  compiled  Franz  Lisp  on  a  VAX  1 1/780. 


datatype 
CDR  code 


immediate  number 


2.4.4.  The  Symbolics  3600  Lisp  Machine 


The  Symbolics  3600  is  a  Til.  personal  minicomputer  for  Lisp  [Roa83,  Moo85].  It  has 

good  performance,  substantial  complexity,  and  high  cost  —  $80,000  for  each  programmer. 

•  Each  word  contains  36  bits:  a  two  bit  field  for  list  compression  (CDR-coding),  a  type 
field  of  two  bits  for  numbers  or  six  bits  for  pointers,  and  either  a  32-bit  data  field  or  a 
28-bit  pointer  field.  This  provides  a  rich  selection  of  hardware-supported  types.  Table 
2.7  lists  some  of  the  34  types  implemented  by  the  3600’s  hardware  and  firmware. 

•  Each  3600  instruction  is  17  bits  long,  with  nine  bits  of  opcode  and  eight  for  the 
operand/address.  There  are  seven  instruction  formats.  Table  2.8  gives  a  sampling  of 
the  opcodes. 

•  Some  of  the  3600’s  instructions  perform  complex  operations.  Instructions  such  as 
multiply,  divide,  and  store-amy-leader  may  take  many  cycles  to  complete.  These 
instructions  must  also  handle  many  different  data-types.  These  factors  combine  to 
require  almost  a  million  bits  of  control  store,  about  twice  that  of  a  VAX-1 1/780. 

•  Tags  in  the  3600  minimize  die  cost  of  dynamic  typing.  In  conventional  systems,  a 
datum’s  type  must  be  determined  before  it  is  used.  A  3600  instruction  assumes  a 


!  Table  2.7:  Some  Symbolics  3600  data  types. 
ARRAY 

|  BIGNUM 

CLOSURE 
COMPILED  CODE 
COMPLEX  NUMBER 
COROUTINE 

EXTENDED  FLOATING  POINT  NUMBER 
FLAVOR-INSTANCE 
FLOAT 

.  LEXICAL  CLOSURE 
LIST 
NIL 

RATIONAL  NUMBER 
SYMBOL 


Table  14:  Some  3600  opcodes. 

Examples 

Data  movement 

pusb-immed 

pop-n-save 

movem-local 

Instance  variable 

push-instance- variable 
mo  vem-instance- variable 
instance-ref 

Function  calling 

call-O-stack 

call-n-return 

funcall-1  -stack 

Binding  and  function  entry 

take-n-args 

take-n-optional-args-rest 

Function  return 

return-stack 

retum-multiple 

Quick  function  call  and  return 

_  P°Pj _ 

Branch 

Catch 

catch-open-stack 

unwind-protect-open 

Predicates 

*q 

not 

fixp 

floatp 

symbolp 

anayp 

Arithmetic 

add-stack 

subtract-stack 

multiply-stack 

quotient-stack 

remainder-stack 

rot-stack 

List  and  symbol 

car 

cdr 

rplaca 

set 

symeval 

property -cell-location 
package-cell-location 

Array 

array-leader 

store-array-leader 

Subprimitive 

halt 

%multipiy-double 
%data-type 
%  pointer 

%stack-group-switch 

%gc-tag-read 

» 


likely  type  and  proceeds,  while  simultaneously  verifying  that  assumption  against  the 
tag.  If  the  assumption  is  false,  the  3600  aborts  the  current  microcode  sequence  and 
starts  executing  microcode  for  the  required  operation.  This  saves  time  for  operations 
on  the  most  common  types. 

•  An  area-based  automatic  storage  reclamation  algorithm  reclaims  space  by  incremen¬ 
tally  copying  surviving  objects.  The  Symbolics  machine  has  paged  virtual  memory 
and  its  paging  hardware  aids  storage  reclamation  by  recording  which  pages  of  per¬ 
manent  objects  contain  references  to  temporary  objects.  Area-based  copying  reclama¬ 
tion  is  very  efficient  (See  the  chapter  on  automatic  storage  reclamation.) 

•  The  3600’s  microcycle  rime  varies  between  180  and  2S0  ns,  making  it  one  of  the 
fastest  commercially  available  personal  computers  for  an  exploratory  programming 
environment  [Pon83b]. 

Although  providing  good  performance,  die  3600’s  $80,000  price  tag  reflects  the  cost  of  seek¬ 
ing  hardware  solutions  to  system  problems. 

2.45.  Katana-32 

Midway  through  the  SOAR  project  we  learned  of  the  Katana-32,  also  known  as 
Sword-32,  an  independent  attempt  by  a  group  of  researchers  at  Tokyo  University,  to  build  a 
fast  VLSI  Smalltalk-80  microcomputer  [SKA 84.  Suz84].  Unlike  our  RISC  approach,  they 
have  continued  with  die  traditional  complex  instruction  set  (CISC)  style  of  computer  archi¬ 
tecture.  Table  2.9  compares  the  Katana  and  SOAR  designs.  Katana's  large  microstore,  vari¬ 
able  length  bytecoded  instructions,  and  160  registers,  suggest  that  it  is  basically  a  Dorado  on 
a  chip.  Table  2.10  shows  the  benchmark  used  for  their  performance  predictions,  with  Table 
2.1 1  showing  the  resulting  object  code  for  both  machines. 

The  designers  of  Katana-32  are  relying  on  aggressive  VLSI  technology  for  their  perfor¬ 
mance  projections.  Their  chip  will  have  five  times  more  transistors  than  SOAR,  and  have 


2.9:  ComDarison  SOAR  a.. d  Katana-32. 


AR  Katana-32 


architecture 
number  of  instructions 
instruction  formats 
instruction  length 
data  path  width 
microstore 
registers 
cycle  time 

number  of  transistors 


testAcrivationReturn  micro-benchmark* 


71  bytes  :  21  bytes 


bytecode  interpreter 

-46 

-9 

1  -3  bytes 
32  bits 

4Kw  x  45  bits 


Table  2.10:  The  4ctivatk  eturn  benchmark. 


malltalk- 


recur.  tl  recur(tl)  { 

tl  «  0  ifTrue:[*selfJ.  if  (tl  •  0)  return 
self  recun  tl  -  1.  recur(tl  -  1) 

‘self  recur  tl  -  1  recur(tl  -  1) 


*  Thu  oat  sacra- baochawt  u  not  i  fair  comparison.  However,  ■  far  ■  w*  know,  n  it  the  only  Kataai  performance 
Sferi  available. 

♦  12  J  with  a  better  compiler. 

t  510  a*  i*  the  measured  cycle  time  of  wortunf  NMOS  SOAJt  chips.  mcludin|  1 10  u  for  the  unexpected  jump  and  call 
delay  fPenIJb.  Pro45aj.  (See  Section  5.4.3.)  125  ns  a  the  projected  cycle  time  for  Katana  |SuzS4). 


h 


% 


Table  2.11:  Test ActivationReturn  object  code. 

SOAR  Machine  Code 

cycles 

%loadc  (r  receiver)classOffset,  it 

2 

%load  (r  retumAddress)0,  rS 

2 

fetrapl  ne  r5,  r6  /*  cache  miss  */ 

1 

slop  eqr  tl,0 

1-2 

jumpt  -+2f 

It 

reow  r  retumAddress,  1 

2 

sub  r  tl,  1,  16 

1 

%addt  r6, 0,  rS  !*  synthesized  move  */t 

It 

%add  rjsclf,  0,  t6  /*  synthesized  move  */ 

1 

call  recur 

1 

<selector> 

sub  r  tl,  1,  r6 

1 

%addt  rf,  0,  r5  t*  synthesized  move  */t 

It 

%add  r_self,  0,  r6  /*  synthesized  move  */ 

1 

call  recur 

1 

%add  r6,0,  r  retVal 

1 

%trap2  geu  r_retval,  CONTEXT JTAG 

1 

remw  rretumAddress,  1 

2 

length 

72  bytes 

:  min  time 

9  cycles 

max  time 

19  cycles 

average 

14  cycles 

Katana-32  Machine  Code  [SKA 84,  Suz84] 

cycles 

pushTemp:  0 

3 

pusbConstanc  0 

2 

3 

jumpFaise:  10 

3-6 

retumSelf 

4 

pushSelf 

2 

pushTemp:  0 

3 

;  pusbConstanc  1 

2 

i  send:  - 

4 

!  send:  recur. 

21 

pop 

1 

:  pushSelf 

2 

pushTemp:  0 

3 

pusbConstanc  1 

2 

send:  - 

4 

send:  recur 

21 

returnTop 

4 

length 

21  bytes 

i  min  time 

13  cycles 

;  max  time 

83  cycles 

<  average  time 

49  cycles 

t  Tbtw  tfucructiofli  coeld  be  «lin«wd  by  c  better  compiler. 
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twice  as  many  register  on  die  datapath,  yet  a  cycle  will  only  take  one  third  the  time.  We 
believe  that  could  SOAR  could  also  run  considerably  faster  if  implemented  in  that  technol¬ 
ogy. 

23.  Reduced  Instruction  Set  Computer  (RISC)  Architecture 

The  machines  described  above  are  more  elaborate  and  expensive  than  conventional 
computers.  We  need  a  machine  that  has  high  performance  at  low  cose  One  recent  style  of 
computer  architecture,  the  reduced  instruction  set  computer  (RISC),  claims  to  meet  those 
demands  for  traditional  programming  systems  [PaD80,PaS81,PaS82].  In  this  style  there  is 
a  much  closer  coupling  between  architecture  and  implementation. 

To  design  a  RISC, 

•  start  with  a  fast  and  simple  register-based  instruction  set  similar  to  microcode  in  other 
machines,  then 

•  identify  die  time-consuming  operations  in  typical  programs,  and  finally 

•  lake  the  hardware  saved  by  simplifying  instruction  execution  and  dedicate  it  to  speeding 
up  the  time  consuming  operations. 

RISC  designs  contrast  with  traditional  high-level  language  computers  that  rely  on  long 
microcode  sequences  to  provide  complex  functions  “in  hardware.”  Instead  of  microcode, 
RISC  systems  rely  on  software  to  provide  complicated  operations.  Of  course,  software  con¬ 
sumes  memory,  but  we  would  gladly  add  memory  to  gain  speed.  The  rest  of  this  section 
touches  cm  several  important  RISCs:  IBM's  801,  Berkeley's  RISC  I  and  II,  and  Stanford’s 
MIPS.  These  reduced  instruction  set  computers  all  point  in  the  same  direction,  more  perfor¬ 


mance  with  less  hardware. 


2.5.1.  IBM-801 


The  IBM-801  computer  pioneered  many  RISC  concepts  [Rad82],  including  a  simple 
load/store  instruction  set  and  the  coupling  of  architecture  design  with  compiler  technology. 
A  sophisticated  graph-coloring  algorithm  enabled  its  compiler  to  optimize  register  allocation 
over  a  fairly  small  register  file  [Cha82],  Constructed  in  ECL,  die  801  attained  excellent  per¬ 
formance.  Although  this  work  was  not  published  immediately,  it  pioneered  the  benefits  of  a 
reduced  instruction  set 

23.2.  RISC  I  and  H 

The  RISC  I  and  II  microprocessor  chips  were  designed  and  built  at  Berkeley  to  yield 
high  performance  for  die  C/Unix  environment  [KSP83].  Figures  2.4  and  23  are  photo¬ 
graphs  of  the  RISC  I  and  n,  respectively. 

•  True  to  their  names,  these  reduced  instruction  set  computers  have  about  two  dozen 
instructions  in  their  instruction  sets,  and  are  distinguished  by  the  simplicity  and  com¬ 
pactness  of  their  control  circuitry  —  5%  to  10%  of  chip  area.  This  contrasts  with  50% 
for  more  typical  designs.  The  minimal  and  simple  control  circuitry  shortens  the  design 
time  as  well  as  instruction  cycle  time. 

•  These  systems  were  designed  for  existing  compiler  technology.  In  this  technology, 
subroutine  calls  are  slow  because  they  save  and  restore  registers.  RISC  1  and  II  speed 
up  subroutine  calls  with  hardware  that  eliminates  this  source  of  overhead.  To  accom¬ 
plish  this,  they  spend  the  area  saved  by  simplifying  the  control  circuitry  on  a  large 
on-chip  register  file,  organized  as  overlapping  windows. 

In  addition  to  providing  good  performance,  reduced  instruction  set  computers  are  easier  to 
design.  RISC  1  met  the  goal  of  functional  correctness  on  first  silicon,  and  RISC  II  ran  at  full 
speed  on  first  silicon,  outperforming  superminicomputers  using  the  same  compiler  technol¬ 
ogy.  A  more  complex  architecture  would  have  jeopardized  these  goals. 
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Figure  2.4  Microphorograph  of  RISC  I. 
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Figure  2.5  Microphotograph  of  RISC  II.  Only  5%  of  the  chip  —  the  upper  nght  comer  — 
is  dedicated  to  control. 


2 S3.  MIPS 


MIPS  sands  for  Microprocessor  without  Interlocked  Pipelined  Stages 
[HJP83,  HJB82).  It  refines  reduced  instruction  set  architecture  by  eliminating  pipeline  inter¬ 
lock  hardware.  Instead,  the  MIPS  project  has  developed  effective  algorithms  to  schedule 
instructioas  for  die  pipeline  statically.  The  results  are  promising: 

•  Instruction  dependencies  are  handled  with  a  one-stage  delayed  branch.  (The  instruc¬ 
tion  following  a  branch  is  always  executed.)  The  MIPS  reorganizer  fills  70%  of  the 
slots  after  delayed  branch  instructions.  Since  these  branches  account  for  20%  of  all 
instructions,  and  since  MIPS  has  one  delay  slot  per  branch  instruction,  there  are  20 
delay  slots  for  every  100  instructions.  Filling  70%  of  them  leaves  only  6  wasted  slots 
per  100  instructions,  which  is  only  6%  slower  than  die  (probably  unrealizable) 
optimum. 

•  Data  dependencies  are  also  handled  by  reordering  instructions.  The  performance  of 
code  generated  this  way  is  within  3%  of  the  code  that  could  be  run  with  hardware  pipe¬ 
line  interlocks. 

•  Another  finding  of  the  MIPS  project  is  that  a  word-addressed  machine  can  run  most 
programs  faster  than  one  with  byte  addressing.  The  problem  with  byte  addressing  is 
that  the  extra  circuitry  required  can  slow  down  word  references. 

•  MIPS  demonstrates  impressive  performance:  a  simulated  MIPS  CPU  with  a  4MHz 
clock  runs  benchmarks  about  five  times  faster  than  a  8Mhz  68010. 

The  MIPS  project  blends  simpler  control  circuitry  with  more  sophisticated  optimizing  com¬ 
piler  technology  to  achieve  more  performance  with  less  hardware. 
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li.  Summary 

The  Smalltalk-80  system  provides  a  programming  environment  that  boosts  a 
programmer’s  productivity.  It  does  so  by  exploiting  the  object  metaphor  to  shorten  the 
edit-compik-iest-debug  cycle.  However  Smalltalk-80,  along  with  other  exploratory  pro¬ 
gramming  environments,  runs  slowly  on  conventional  hardware. 

We  have  designed  a  reduced  instruction  set  computer,  and  added  features  to  it  to  sup¬ 
port  Smalltalk.  In  doing  so,  we  have  followed  in  the  footsteps  of  other  architecture  projects: 

•  The  RICE  computer  pioneered  tags,  as  a  means  to  control  data  manipulations. 

•  Hie  Burroughs  B5700  and  B6700  computers  supported  Algol  with  tagged  data, 
descriptors,  and  a  tailored  instruction  set 

•  Scbeme-79  was  die  first  attempt  to  marry  Mead-Conway  VLSI  design  with  an  interpre¬ 
tive  language. 

•  The  Symbolics  3600  Lisp  Machine  is  a  commercially  successful  computer  dedicated  to 
a  specific  exploratory  programming  environment 

•  IBM-801  revived  interest  in  simple  computers  and  highly  optimizing  compilers  for 
non-floating  point  applications. 

•  RISC  I  and  II  at  Berkeley  taught  us  much  about  instruction  sets,  register  windows,  and 
dan  path  design. 

•  The  MIPS  machine  at  Stanford  encouraged  us  to  forego  byte  addressing. 

SOAR  combines  a  simple,  RISC  architecture,  with  enough  tagging  to  support  the  com¬ 
mon  cases.  In  the  following  chapters,  we  describe  SOAR’s  architecture,  assess  the  worth  of 
each  architectural  feature,  explain  important  algorithms  in  its  system  software,  and  propose 
designs  for  future  systems. 
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Chapter  3 


The  SOAR  Architecture 

3.1.  Introduction 

This  chapter  describes  the  SOAR  architecture,  contrasting  SOAR  with  its  predecessor. 
RISC  n.  Most  ionovaboiis  in  SOAR  compensate  for  sources  of  overhead  in  Smalltalk-80 
systems:  run-time  type  checking,  virtual  machine  interpretation,  elaborate  and  frequent  pro¬ 
cedure  calls,  and  maintaining  many  small,  dynamic  data  structures.  We  conclude  with  an 
overview  of  the  implementation,  detailed  in  Pendleton’s  doctoral  dissertation  [PenSSb].  A 
summary  of  this  chapter  has  been  previously  published  [UBF84],  A  more  detailed  architec¬ 
tural  description  appears  in  [SKF85]. 

Two  figures-pf-merit  accompany  each  feature:  execution  time  and  memory  space.  We 
gauge  a  feature’s  significance  by  examining  what  would  happen  if  we  left  it  out.  Thus  an 
omission  time  cost  of  50%  means  that  a  job  requiring  100  cycles  on  full  SOAR  would  take 
100  ♦  50,  or  150  cycles  without  die  feature.  Likewise  an  omission  space  cost  of  33%  indi¬ 
cates  that  die  whole  Smalltalk-80  system  would  grow  by  33%,  from  1.5  mB  to  2.0  mB. 
With  these  metrics,  we  can  find  the  combined  impact  of  removing  two  independent  features 
limply  by  adding  the  omission  costs  for  each.  These  data  are  the  results  of  simulations  and 
assume  no  radical  compiler  changes.  (The  derivation  of  the  numbers  is  explained  in  the  next 
chapter  and  in  Appendix  A.) 

32.  Type  Checking 

The  FORTRAN  statement  “I  »  J  +  K”  denotes  integer  addition,  and  can  be  performed 
with  a  single  add  instruction.  But.  since  Smalltalk-80  has  no  type  declarations,  J  and  K  may 
bold  values  of  any  type,  from  booleans  to  B-trees.  Thus,  every  time  a  Smalltalk-80  system 
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evaluates  "J  +  K”,  it  must  first  check  the  types  and  then  perform  the  appropriate  operation. 
Measurements  of  conventional  Smalltalk-80  systems  show  that  over  90%  of  the  “+”  opera¬ 
tions  do  the  simplest  possible  operation,  integer  addition  [Bla83c].  Since  a  type  check  takes 
at  least  as  long  as  an  add  instruction,  most  Smalltalk-80  systems  waste  a  lot  of  time  checking 
types  for  integer  arithmetic. 

3.2.1.  Tags  Trap  Bad  Guesses 

The  purpose  of  data  tags  in  SOAR  is  to  improve  performance,  not  to  discover  program 
errors  as  in  the  R-2  and  B6700.  SOAR's  instruction  set  follows  other  Smalltalk-80  imple¬ 
mentations  in  having  only  two  types  of  tagged  data:  integers  and  pointers  [GoR83].  In 
SOAR,  dm  high-order  bit  of  each  word  distinguishes  these  two  types.  For  arithmetic  and 
comparison  operations,  SOAR  assumes  that  the  operands  are  integers  and  begins  the  opera¬ 
tion  immediately,  simultaneously  checking  the  tags  to  confirm  the  guess.  Most  often 
(>92%,  Table  A.4)  both  operands  are  integers  and  the  correct  result  is  available  after  one 
cycle.  If  not,  SOAR  aborts  the  operation  and  traps  to  routines  that  cany  out  the  appropriate 
computation  for  the  data  types.  Figure  3.1  shows  the  SOAR  tags.  This  feature  is  very 
important;  without  it,  SOAR  would  ran  26%  slower  and  require  15%  more  memory  (Tables 
A.7  and  A.8).  SOAR  is  die  only  Smalltalk-80  system  that  overlaps  these  operations.  Every 
other  Smalltalk-80  system  incurs  a  time  penalty  for  serial  tag  checking.  It  would  be  very 
difficult  for  an  optimizing  compiler  eliminate  these  checks  in  the  absence  of  type  declara¬ 
tions. 

3JL2.  Conditional  Skip  Instructions 

Although  condition  codes  have  been  widely  used  to  decouple  a  test  from  a  branch,  they 
are  awkward  for  a  Smalltalk  system.  Instead  of  condition  codes,  SOAR  has 
compare-and-skip  instructions  that  quickly  perform  integer  comparisons.  Remember  that 
Smalltalk  has  dynamic  type  binding.  Thus,  in  SOAR,  “i  <  j”  must  be  computed  with  an 
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Figure  3J:  SOAR  tagged  data  types.  SOAR  supports  two  data  types,  31-bit  signed  in¬ 
tegers  and  28-bit  pointers.  Pointers  include  a  generation  tag  (as  explained  in  Section  3.5.1). 
SOAR  words  could  have  contained  32  bits  of  data  plus  one  bit  of  tag  for  a  total  of  33  bits. 

The  scarcity  of  33-bit  tape  drives,  disk  drives,  sad  memory  boards  led  us  to  shorten  our 
words  to  a  total  of  32  bits  including  the  tag  (3 1  bits  of  data). 

instruction  that  checks  the  tags  of  i  and  j  as  it  compares  them.  If  die  condition  holds,  there  is 
a  one  cycle  penalty  for  skipping  an  instruction.  If  the  condition  fails,  the  instruction  follow¬ 
ing  the  skip  is  executed.  This  is  usually  a  jump.  What  if  one  of  the  operands  is  not  an 
integer?  A  trap  to  the  appropriate  comparison  software  will  be  taken.  In  a  condition  code 
architecture,  this  software  (e.g.  the  floating  point  compare  routine)  would  have  to  set  the 
condition  codes  to  reflect  die  result  In  SOAR,  all  it  must  do  is  return  to  the  next  instruction 
or  the  one  after  that,  a  simpler  and  faster  operation. 

Separating  a  conditional  jump  into  a  conditional  skip  and  unconditional  jump  does  not 
impose  a  significant  performance  penalty.  SOAR  jump  instructions  contain  the  absolute 
address  of  the  target  instruction.  Because  no  address  computation  is  required,  SOAR  elim¬ 
inates  the  instruction  prefetch  penalty  for  jumps  (see  Fast  Shuffle  in  Section  3.4).  Thus,  a 
conditional  branch  can  be  simulated  in  two  cycles,  one  for  the  skip  and  one  for  the  jump. 
The  only  way  to  speed  up  conditional  branches  would  be  to  add  a  one  cycle 
compare-and-branch  instruction  to  SOAR.  Such  an  instruction  would  require  the  addition 
of  a  separate  adder  to  compute  the  branch  target  address  in  parallel  with  the  comparison 
operation.  Worse,  it  would  only  speed  up  SOAR  by  3%,  which  would  not  justify  the  addi¬ 
tional  hardware.  (See  Section  A.2.2.) 
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313.  Two-Tone  Instructions 

A  tigged  architecture  that  licks  microcode  most  include  instructions  that  manipulate 
and  inspect  tags.  Because  the  Smalltalk  system  already  relies  on  the  compiler  to  ensure  sys¬ 
tem  integrity,  we  can  allow  die  compiler  to  mix  instructions  that  manipulate  tags  with 
instructions  that  are  constrained  by  tags.  Each  SOAR  instruction  contains  a  bit  that  either 
enables  or  disables  tag  checking.  Untagged  mode  (indicated  by  a  %  in  the  assembly 
language)  nuns  off  all  tag  checking  and  operates  on  raw  32-bit  data.  In  untagged  mode  die 
tag  bits  are  treated  as  data,  and  die  complete  instruction  set  can  be  used  to  manipulate  this 
data.  Untagged  instructions  also  allow  programs  written  in  conventional  languages  such  as 
C  and  Pascal  to  run  on  SOAR.  Instead  of  providing  two  versions  of  each  instruction,  we 
could  have  defined  a  mode  bit  in  the  PSW.  This  would  have  been  very  expensive,  increasing 
execution  tune  by  16%  and  memory  usage  by  19%  (Tables  A.1 1  and  A.12). 

32.4.  Tagged  Immediate  Operands 

SOAR’s  immediate  format  has  been  designed  to  accommodate  tagged  data.  The 
high-order  four  bits  of  die  12-bit  field  becomes  the  tag  bits  of  the  operand,  die  low  order 
seven  bits  of  die  immediate  field  form  the  low  order  seven  bits  of  the  operand,  and  die  eighth 
bit  is  sign-extended  to  fill  in  the  bits  in  die  middle  (see  Figure  3.2).  Thus,  any  tagged  value 
between  -128  and  127  can  be  represented  as  shown  in  Table  3.1.  This  saves  time  by  allow¬ 
ing  the  Smalltalk-80  software  to  encode  some  important  tagged  values  as  immediate 
operands.  Of  course,  there  is  no  such  thing  as  a  free  lunch.  Reserving  four  tag  bits  severely 
curtails  the  range  of  addresses  and  offsets  from  -2048-2047  to  -128-127.  However,  this 
representation  optimizes  the  more  frequent  case  and  improves  performance  by  10%  (Table 
A.15). 


Tabic  3.1:  Useful  immediate  values. 


Immediate  Field  '  Ex 


from  to 


- bit  Integers 


FFF  :  FFFFFF80  H+H+FF 
07F  ■  00000000  0000007F 


31 -bit  Integers 


7FFFFFFF 

0000007F 


Pointers  to  Frequently  Referenced  Objects 
(includes  nil,  true,  and  false) 


BOO  B7F  j  B 0000000  B000007F 


Values  for  Testing  Tags  of  Pointers 


-128 

-1 

0 

127 

assistant  generaoon 
associate  generation 
full  generation 
emeritus  generation 
activation  record 


Smalltalk  can  be  transported  to  a  new  machine  by  writing  only  the  virtual  machine  emula¬ 


tor. 

This  approach  has  drawbacks  too: 

•  Decoding  such  dense  instructions  takes  either  substantial  hardware  or  substantial  time. 
For  example,  the  Dorado  Instruction  Fetch  Unit  consumes  20%  of  the  CPU  [Pie83],  and  in 
Berkeley  Smalltalk,  decoding  a  simple  bytecode  takes  twice  as  long  as  executing  it 

•  Some  of  die  high-level  instructions  require  many  microcycles  to  execute.  These  multicy¬ 
cle  instructions  must  be  sequenced  by  a  dedicated  control  unit 

3.3.1.  Reduced  Instruction  Set 

Following  die  reduced  instruction  set  approach,  we  abandoned  the  Smalltalk  virtual 
machine  instruction  set  and  designed  die  SOAR  instruction  set  from  scratch  to  minimize  the 
time  and  hardware  needed  to  decode  and  execute  instructions.  SOAR  instructions  therefore 
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resemble  microinstructions.  Although  such  an  instruction  set  results  in  larger  object  code, 
we  believe  that  the  cost  of  500  KB  of  additional  main  memory  is  offset  by  an  approximate 
doubling  in  speed. 

Bach  SOAR  instruction  occupies  a  32-bit  word,  and  most  instructions  take  one  cycle. 
The  only  exceptions  are  loads,  stores,  and  returns,  which  take  two  cycles.  The  uniform 
length  and  duration  of  instructions  simplify  instruction  prefetch.  Figure  3.3  shows  instruc¬ 
tion  formats. 

SOAR  departs  from  RISC  II  by  omitting  byte-addressing.  Instead,  separate  instruc- 
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lions  insert  or  extract  bytes  from  words.  Unlike  systems  for  other  languages  such  as  C, 
Smalltalk-80  systems  do  not  support  scalar  data  types  that  occupy  a  single  byte.  (The  sys¬ 
tem  software  uses  bytes  to  pack  fields  into  the  object  header.)  Processors  with 


byte-addressing  incur  a  time  penalty  due  to  the  alignment  logic.  Even  if  no  penalty 
occurred,  adding  byte  addressing  would  only  improve  performance  by  7%  (Table  A.  17).  On 


If  the  object  is  a  tagged  integer,  its  type  must  be  supplied  by  *  trap  handler.  Dedicating  an 
opcode  to  this  function  saves  time  in  the  trap  handler.  Likewise  die  sll  instruction  allows  a 
tag  trap  to  be  treated  differently  according  to  whether  addition  or  shifting  was  intended. 
Neither  of  these  cloned  instructions  is  very  important  The  loadc  instruction  realizes  only  a 
0.5%  performance  improvement  (Table  A.18).  We  believe  that  die  sll  instruction  would  not 
improve  performance  much  either.  Since  the  compiler  used  for  these  studies  did  not  go  to 
the  trouble  to  generate  it  we  could  not  measure  the  frequency  of  this  instruction. 

Two  glaring  omissions  from  SOAR  are  a  barrel  shifter  for  single-cycle,  multiple-bit 
shifts  and  support  for  integer  multiplication  and  division.  Although  multiple-bit  shifts  may 
be  important  for  driving  die  bitmapped  display,  they  would  speed  up  normal  Smalltalk-80 
programs  by  less  than  0.4%  (Table  A.  19).  Likewise,  instantaneous  multiplicadon  and  divi¬ 
sion  would  shave  only  3%  off  of  our  benchmark  times  (Table  A.20). 

One  drawback  of  SOAR's  reduced  instruction  set  is  the  increased  time  for  compilation. 
Bush  has  written  a  convener  in  Smalltalk  that  translates  bytecodes  to  SOAR  instructions 
[Bus85].  He  reports  that,  running  on  a  Dorado,  the  mean  time  to  convert  a  subroutine  is  50 
ms,  and  that  “Subjectively,  the  converter  does  not  intrude  on  interactive  system  use.  . 
The  extra  time  needed  to  compile  to  SOAR  instructions  does  not  seem  to  pose  a  problem. 

More  significandy,  SOAR’s  simple  instruction  set  enlarges  compiled  code.  Experi¬ 
ence  with  Hilfinger’s  Slapdash  SOAR  compiler  suggests  that  mi  the  average,  one  bytecode 
results  in  one  32-bit  SOAR  instruction.  Thus,  ignoring  data  objects,  object  headers,  and 
literal  data  within  subroutines,  there  is  a  fourfold  code  expansion.  However,  bytecodes  con¬ 
stitute  only  about  one  eighth  of  a  32-bit  Smalltalk-80  image,  and  the  net  increase  is  only  0.5 
MB  over  the  original  1  MB.  This  is  not  an  exorbitant  price  to  pay  given  current  memory 


technology. 


Other  compiled  Smalltalk-80  systems  also  pay  this  price.  The  Xerox  68010  system 
devotes  0.25  MB  to  a  cache  of  compiled  code  [DeS84).  Deutsch  reports  that  one  byte  ode 
results  in  six  bytes  of  MC68010  instructions,  which  is  worse  than  the  factor  of  4  for  SOAR 
[Deu85].  This  means  that  if  it  were  to  compile  all  of  the  code,  as  the  SOAR  system  does, 
the  Xerox  68010  system  would  need  0.7  MB  (Table  33). 

Finally,  our  decision  to  abandon  bytecodes  will  force  us  to  rewrite  the  Smallta Dc-80 
debugger.  Lee  has  designed  a  debugger  for  SOAR  and  has  built  a  prototype  in  Berkeley 
Smalltalk  [Lee84],  He  exploited  the  hardware  organization  of  SOAR  in  the  design  of  the 
debugger  to  add  a  conditional  breakpoint  facility  and  increase  execution  speed  during 
debugging. 


3J1  SOAR  Interrupts  and  Traps 

Interrupts  and  traps  play  a  larger  role  in  SOAR  than  in  RISC  11.  Unlike  C,  Smalltalk 
grew  in  an  environment  with  extensive,  system-specific  microcode.  Since  SOAR  has  no 
microcode,  unusual  situations  must  be  met  with  a  trap  to  a  software  handler.  For  example, 
as  described  above,  other  Smalltalk  implementations  check  the  types  of  arithmetic  operands 
sequentially,  before  performing  the  operation.  SOAR  checks  in  parallel,  trapping  if  the 
operands  are  not  simple  integers.  These  account  for  about  half  of  the  traps  (Table  A.2S). 

How  valuable  are  conditional  trap  instructions?  They  save  time  and  space  by  replacing 
a  two-cycle  two-instruction  sequence  with  one  single-cycle  instruction.  For  instance,  the 
prologue  in  each  subroutine  uses  a  conditional  trap  instruction  that  verifies  die  type  of  its 


Table  3.3:  Space  Penalty  of  Compilation. 


ystem _ execution  model  _ code  expansion  ratio  memory  required* 


Berkeley  Smalltalk  bytecode  interpreter  1  1.0 

Xerox  68010  cache  of  compiled  code  6  1.3 

SOAR  compiles  everything  4  13 

hypothetical  68010  compiles  everything  6  1.7  MB 


first  argument  This  saves  a  cycle  over  a  skip  and  branch  in  die  common  case.  Trap  instruc¬ 
tions  also  support  type  checking  in  low-level  primitive  routines,  and  tag  checking  for 
automatic  storage  reclamation.  However,  if  the  trap  instruction  traps,  it  takes  more  time  to 
handle  die  trap  than  the  jump  from  a  skip-and-jump  sequence.  In  fact  trap  instructions 
account  for  10%  of  the  traps  (Table  A .25).  Despite  all  these  uses,  the  savings  from  trap 
instructions  does  not  add  up  to  much;  SOAR  would  run  only  4%  slower  and  require  only  2% 
more  memory  without  them  (Tables  A.23  and  A.24).  The  Act  that  trap  instructions  save  lit¬ 
tle  rime  results  more  from  the  low  frequency  of  trap  instructions  than  from  the  penalty  asso¬ 
ciated  with  taking  die  traps. 

The  remaining  source  of  traps  also  arises  in  RISC  II.  A  call  or  return  that  exceeds  the 
on-chip  register  window  capacity  must  trap  to  a  routine  to  save  or  restore  a  set  of  registers. 
This  accounts  for  the  remaining  40%  of  the  traps  (Table  A.25). 

To  reduce  the  cost  of  trapping,  SOAR  exploits  shadow  registers  that  catch  the 
operands  of  the  trapping  instruction.  These  are  inexpensive  in  single-chip  processors;  they 
are  just  two  more  registers  on  die  data  busses  near  the  ALU.  This  feature  is  insignificant; 
without  it,  SOAR  would  run  only  0.04%  slower  and  require  no  more  memory  (Table  A 216). 
Other  features  that  simplify  trap  handling  include  simple  instructions  and  uniform  instruc¬ 
tion  size. 

SOAR  does  not  support  nested  interrupts  or  traps  because  they  complicate  the  architec¬ 
ture.  The  interrupt-enable  bit  in  the  PSW  (Figure  3.4)  is  reset  upon  an  interrupt  or  trap. 
Each  trap  handler  first  captures  any  necessary  machine  state,  then  re-enables  interrupts. 
Most  handlers  need  their  own  register  window  to  hold  this  state.  The  normal  method  to 

obtain  a  new  register  window  would  be  to  execute  a  call  instruction  but,  since  a  call  can 

« 

cause  a  trap  (see  above),  the  trap  handler  must  simulate  the  call  (and  trap).  After  getting  a 
new  window  and  saving  the  machine  state,  the  handler  can  re-enable  interrupts  (and  option¬ 
ally  surrender  its  register  window)  with  a  form  of  the  return  instruction. 


Figure  3.4:  SOAR  Program  Status  Word.  The  SOAR  program  status  word  contains  a  desti¬ 
nation  register  shadow  field,  an  opcode  shadow  field,  and  enable  bits  for  external  and 
software  interrupts. 

When  an  interrupt  or  trap  occurs,  the  instruction  that  is  executing  is  aborted  before  it 
can  change  any  registers.  The  address  of  the  aborted  instruction  is  saved  in  r7.  I/O  inter¬ 
rupts  are  disabled  by  clearing  die  interrupt  enable  bit  in  die  PSW.  This  freezes  the  shadow 
registers,  which  normally  trade  the  ALU  inputs.  A  vector  is  constructed  from  die  trap  base 
register,  die  opcode  of  the  aborted  instruction,  and  die  reason  for  the  trap.  Finally,  control  is 
transferred  to  die  vectored  location.  Table  3.4  lists  the  various  categories  of  traps,  with 
intemipt  priority  listed  from  highest  to  lowest . 

Many  instructions  can  trap  for  several  reasons  at  once.  To  simplify  the  interface  to  the 
trap  handler  code,  the  reasons  are  prioritized.  After  handling  a  trap,  the  offending  instruc¬ 
tion  is  typically  reexecuted  to  spring  any  remaining  naps.  Table  3.5  shows  which  reasons 
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Table  3.4:  SOAR  traps  and  interrupts. 
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apply  to  which  instructions.  If  instead  of  vectoring.  SOAR  put  the  reason  for  the  trap  in  a 
special  register  the  system  would  be  only  3%  slower  (Table  A .28). 

When  SOAR  does  trap,  it  expends  two  extra  cycles  to  flush  the  pipeline.  A  one -cycle 
trap,  while  feasible,  would  have  significantly  degraded  die  cycle  time  [Pen85b].  Since  die 
extra  trap  cycle  increased  die  number  of  cycles  by  less  than  one  percent,  the  net  result  was  a 
faster  system. 

3.4.  Fast  Calls 
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The  Smalltalk-80  system  stresses  program  modularity,  but  omits  macros  because  they 
would  make  it  harder  to  incorporate  changes  quickly.  If  the  user  changed  a  macro,  the  sys¬ 
tem  would  have  to  recompile  all  of  die  modules  that  instantiated  it  This  would  make  it 
more  difficult  to  maintain  die  split-second  response  time  that  is  crucial  to  highly  productive 
programming.  Instead,  Smalltalk-80  programs  are  broken  up  into  many  small  subroutines. 
Consequently,  Smalltalk-80  systems  execute  a  higher  percentage  of  call  instructions  than 
most  other  systems.  In  addition  to  being  frequent,  calls  are  also  expensive  because: 

•  To  aid  program  debugging,  Smalltalk-80  initializes  all  local  variables  on  each  call. 
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•  A  consequence  of  S  mallta Ik-80 's  power  is  that  the  destination  of  a  call  is  recomputed 
from  the  type  of  the  first  argument  with  a  table  lookup  each  time  the  call  is  executed. 


Table  3.5:  Trap  reasons  by  instruction  category. 
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The  result  is  that  many  Smalltalk  implementations  (including  Berkeley  Smalltalk  and 
Dorado  Smalltalk)  spend  about  half  of  their  time  on  calls  and  returns  [Deu81].  SOAR 
reduces  the  Smalltalk  call/retum  overhead  in  several  ways. 

3.4.1.  Multiple  Overlapping  On-Chip  Register  Windows 

SOAR,  like  RISC  I,  optimizes  subroutine  calls  and  returns  by  providing  a  large, 
on-chip  register  file.  The  registers  are  divided  up  into  overlapping  windows.  Instead  of  sav¬ 
ing  or  restoring  registers,  calls  or  returns  merely  switch  windows  (Figure  3.5).  Compared  to 
C  language  subroutines,  the  shorter  Smalltalk  subroutines  pass  fewer  operands  and  use  fewer 
local  variables,  and  so  need  fewer  registers.  For  this  reason,  each  SOAR  register  window 
has  eight  registers  instead  of  12  for  RISC  1.  Figures  3.6  and  3.7  show  the  register  organiza¬ 
tion  of  SOAR.  In  addition  to  56  more  registers,  the  inclusion  of  register  windows  results  in 
die  addition  of  a  register  to  select  the  current  window  (the  Current  Window  Pointer,  or  cwp), 
a  register  to  detect  overflows  by  recording  the  last  saved  window  (Saved  Window  Pointer,  or 
swp).  more  elaborate  register  decoders,  and  trapping  logic  [Pen85b).  Despite  die  cost  of  all 
the  added  hardware,  Smalltalk-80's  predilection  for  procedure  calls  makes  this  feature  very 
important  The  cost  of  saving  and  restoring  a  conventional  register  file  would  slow  the 
machine  down  by  46%,  even  with  load-  and  store-multiple  instructions  (Table  A .29). 


Physical  Registers  Logical  Registers 


Figure  3.6:  SOAR's  register  windows.  Like  RISC  I.  SOAR  has  many  physical  sea  of  le¬ 
gmen  that  nap  to  the  logical  registcn  seen  by  each  subroutine. 


Figure  3.7 .-  Logical  view  of  register  file.  The  HIGHs  hold  incoming  parameters  and  local 
variables.  The  LOWs  are  for  outgoing  arguments.  The  SPECIALS  include  the  PSW  and  a 
register  that  always  contains  zero.  The  GLOBALs  are  for  system  software  such  as  trap 
handlers. 


When  the  number  of  activations  oo  the  stack  exceeds  die  on-chip  register  capacity. 
SOAR  trips  to  a  software  routine  that  saves  the  contents  of  a  set  of  registers  in  memory. 
Unlike  RISC  II.  SOAR  has  load-  and  store-multiple  instructions  to  speed  register  saving  and 
restoring.  These  instructions  can  transfer  eight  registers  in  nine  cycles  (one  instruction  fetch 
and  eight  data  accesses).  Without  them,  the  system  would  need  eight  individual  instructions 
that  would  consume  sixteen  cycles  (eight  instruction  fetches  phis  eight  data  accesses). 
Load-  and  store-multiple  are  also  helpful  for  garbage  collection,  copying  data,  and  opera¬ 
tions  on  bit-mapped  images.  These  instructions  have  the  ability  to  operate  on 
noo-cootiguous  data;  die  increment  between  memory  references  is  given  by  the  SOURCE2 
field.  In  retrospect,  these  multi-cycle  instructions  added  some  complexity  to  the  design,  and 
the  benefits  —  3%  of  execution  time  and  2%  of  memory  —  may  not  be  worth  die  costs 
(Tables  A  33  and  A.34). 

3.42  Caching  Call  Targets  In  Line 

Another  way  SOAR  reduces  subroutine  overhead  is  by  decreasing  the  time  taken  to 
find  the  target  of  a  call.  Once  computed,  the  target's  address  is  cached  in  die  instruction 
stream  for  subsequent  use,  as  suggested  by  Schiffman  and  Deutscb  [DeS84].  Figures  3.8 
and  3.9  illustrate  this  idea.  This  in-line  caching  exacts  a  price  for  its  time  savings;  SOAR 
must  support  non-reentrant  code.  Since  all  Smalltalk  processes  share  the  same  address  space, 
process  switches  must  be  avoided  in  sections  of  code  that  modify  or  use  the  cached  data. 
One  approach  would  be  to  implement  semaphores  in  software.  This  would  be  too  expensive 
because  each  Smalltalk  call  e&cutes  a  short  non-reentrant  section  of  code.  The  approach  we 
followed  was  to  add  a  bit  to  each  instruction  to  disable  process  switches. 

In  Smalltalk,  calls  and  jumps  are  so  frequent  that  die  virtual  machine  can  defer  a  pro¬ 
cess  switch  until  executing  the  next  call  or  jump  instruction.  The  SOAR  call  and  jump 
instructions  include  a  bit  to  specify  when  it  is  safe  to  switch  processes  [Deu82b].  This  bit 


enables  a  software  interrupt.  When  the  operating  system  desires  a  process  switch,  it  sets  a 
bit  in  the  Program  Status  Word  requesting  the  software  interrupt  and  resumes  execution  of 
the  same  process.  The  next  time  a  safe  jump  or  call  is  executed,  the  software  interrupt 
transfers  control  to  die  operating  system  which  can  then  safely  suspend  the  process. 

Although  complicated,  in-line  caching  pays  handsome  rewards.  The  conventional  way 
to  cache  call  targets  is  a  hash  table.  But  the  overhead  for  probing  into  a  hash  table  would 
slow  SOAR  by  33%  (Table  A. 37).  The  hardware  penalty  for  in-line  caching  is  the  software 
trap  mechanism.  If  we  were  forced  to  omit  this,  we  could  use  an  indirect  in-line  cache.  The 
informations  could  be  cached  in  a  per-process  data  area  instead  of  the  call  instruction.  This 
would  slow  SOAR  down  by  7%  (Table  AJ37).  Even  with  in-line  caching,  SOAR  still  spends 
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Figure  3.8:  Caching  the  target  address  in  the  instruction  stream.  In  this  example,  the  print 
routine  is  called  with  an  argument  that  is  a  string.  (The  argument  is  passed  in  >6.)  The  first 
time  the  call  instruction  is  executed,  the  call  contains  the  address  of  a  lookup  routine  and  the 
word  after  the  call  contains  a  pointer  to  the  name  "prim.”  The  lookup  routine  follows  the 
pointers  to  the  entry  table  for  strings,  and  finds  the  entry  for  "prim.”  It  then  overwrites  the 
call  instruction  with  a  call  to  that  routine  and  replaces  the  word  after  the  call  with  the  type 
of  the  argument  (string). 


off-chip  latch  to  store  the  incoming  instruction  and  send  it  back  to  memory.  Figure  3.10 
illustrates  the  Fast  Shuffle  logic.  Though  not  spectacular,  its  performance  impact  is 
significant  SOAR  would  use  11%  more  cycles  without  the  Fast  Shuffle. 

Pendleton  has  uncovered  a  serious  flaw  in  our  realization  of  the  Fast  Shuffle  [Pen85a]. 
When  a  Jump  or  call  instruction  follows  a  skip,  the  skip  condition  must  be  evaluated  before 
the  chip  can  signal  a  Fast  Shuffle  to  the  memory  system.  If  the  condition  bolds,  the  memory 
system  must  use  the  PC  as  the  address  of  the  next  instruction;  if  the  conditon  fails  the 
memory  system  must  use  the  target  field  from  the  jump  or  call  instruction.  In  designing  the 
instruction  sec  we  encoded  the  condition  field  (of  skip  and  trap)  so  tightly  that  a  PLA  was 
required  to  decode  the  condition  and  the  output  of  the  ALU.  This  PLA  adds  1 10  ns  to  the 
time  needed  to  compute  the  Fast  Shuffle  control  signal  during  a  skip  instruction.  Although 
the  NMOS  SOAR  chips  can  execute  an  instruction  in  400  ns,  die  memory  system  can  not 
stan  die  next  instruction  fetch  for  another  100  ns.  reducing  the  effective  cycle  time  to  about 
510  ns.  This  overhead  could  be  eliminated  by  foregoing  the  Fast  Shuffle  and  using  delayed 
branches  and  calls.  Alternatively,  the  instruction  set  could  be  redesigned  with  a  condition 
field  that  could  be  decoded  more  quickly.  This  problem  would  have  been  found  much  ear¬ 
lier  if  we  had  simulated  die  whole  system  instead  of  the  processor. 

3.4.4.  The  Return  Instruction:  Parallel  Register  Initialization 

The  other  half  of  the  team  is  the  return  instruction.  In  SOAR,  die  return  instruction 
performs  one  compulsory  and  three  optional  functions,  specified  by  the  low-order  three 
opcode  bits.  The  compulsory  function  is  a  transfer  of  control,  which  means  that  the 
bare-bones  return  instruction  can  be  used  as  an  indirect  jump.  If  tag  checking  is  enabled,  the 
tag  of  the  return  address  is  checked.  This  provides  a  means  to  intercept  returns  when  the 
activation  record  must  be  saved.  The  first  optional  function  enables  interrupts  and  yields  a 
“return  from  interrupt’'  instruction.  The  second  optional  function  increments  the  cwp 
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Figure  3 JO:  Fast  Shuffle  logic,  when  a  call  or  jump  is  fetched  from  memory,  the  next  in¬ 
struction  is  prefetch ed  based  on  the  external  address  latch  instead  of  the  PC. 

(changing  register  windows)  for  returning  from  a  normal  call. 

The  Smalltalk-80  language  requires  local  variables  to  be  initialized  to  nil,  so  the  last 
optional  function  for  SOAR’s  return  instruction  prepares  registers  8  through  13  for  a  future 
call  by  writing  nil  into  them.  Instead  of  commencing  each  subroutine  with  an  instruction 
sequence  to  write  nil  into  each  register  that  will  contain  a  local  variable,  SOAR  exploits 
VLSI  circuitry  to  initialize  the  registers  in  parallel.  Although  it  would  be  more  straightfor¬ 
ward  for  the  cal)  instruction  to  perform  this  initialization,  this  would  slow  down  die  call. 
Instead,  we  have  placed  this  functionality  in  the  return  instruction.  Since  die  return  instruc¬ 
tion  must  wait  an  extra  cycle  to  fetch  its  target  instruction,  the  "nilling”  does  not  slow  the 
instruction  down.  This  feature  eliminates  the  extra  time  required  to  initialize  the  registers 
after  every  call.  Ironically.  Smalltalk-80  subroutines  use  so  few  temporary  variables  —  less 
than  one  on  the  average  —  that  this  feature  has  little  favorable  impact  The  system  would 
only  run  4.3%  slower  and  use  1%  more  memory  without  it 
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3i.  Object-Oriented  Storage  Management 

Smalltalk-80  data  structures  are  called  objects.  SOAR  objects  average  14  words  in 
length  and  live  for  about  500  instructions.  Smalltalk-80  objects  are  smaller  and  more  vola¬ 
tile  than  data  structures  in  most  other  exploratory  programming  environments.  Smalltalk-80 
systems  face  three  challenges  in  managing  storage  for  objects: 

•  Automatic  storage  reclamation  —  On  average,  12  words  of  data  are  freed  and  must  be 
reclaimed  per  100  Smalltalk-80  virtual  machine  bytecodes  executed. 

•  Virtual  memory  —  All  objects  must  be  in  the  same  address  space. 

•  Object-relative  addressing  —  Although  offsets  into  objects  are  known  at  compile-time, 
base  addresses  are  not  Code  must  be  compiled  to  address  fields  relative  to  dynamically 
determined  base  addresses. 


3.5.1.  Automatic  Storage  Reclamation 


SOAR  supports  Generation  Scavenging  to  reclaim  storage  efficiently  without  requiring 
costly  indirection  or  reference  counting  (see  Section  5.8).  This  algorithm  is  based  on  the 
observation  that  most  objects  either  die  young  or  live  forever.  Thus,  objects  are  placed  into 
two  generations  and  only  new  objects  are  reclaimed.  A  better  method  of  storage  reclamation 
has  a  strong  impact  on  performance;  most  other  algorithms  would  squander  10%  to  15%  of 
SOAR’s  time  on  automatic  storage  reclamation  instead  of  Generation  Scavenging’s  3%. 
(see  Chapter  5).  Hence,  without  Generation  Scavenging  SOAR  would  takr  4%  to  15%  more 
cycles  to  run  die  benchmarks. 

Traditional  software  and  microcode  implementations  of  object-oriented  systems  rely 
on  an  object  address  table  (Figure  3.1 1).  Each  field  of  an  object  contains  an  index  into  this 
table,  and  the  table  entry  contains  the  address  of  each  object.  The  level  of  indirection  sup¬ 
plied  by  the  table  provides  support  for  compaction.  As  explained  in  Chapter  5,  Generation 
Scavenging  provides  compaction  for  free,  permitting  SOAR  to  function  without  an  object 
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table  (Figure  3.12).  Without  this  algorithm,  the  extra  work  to  follow  the  indirect  pointers 
through  the  object  table  would  slow  SOAR  down  by  20%  (Section  5.9.4). 


Generation  Scavenging  requires  that  a  list  be  updated  whenever  a  pointer  to  a  new 
object  is  stored  in  an  old  object  When  designing  SOAR,  we  thought  that  stores  would  be 
frequent  enough  to  warrant  hardware  support  for  this  check.  Thus  SOAR  tags  each  pointer 
with  the  generation  of  the  object  that  it  points  to.  While  computing  the  memory  address,  the 
store  instruction  compares  the  generation  tag  of  the  data  being  stored  with  the  generation  tag 
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Figure  3  Jl:  Indirect  addressing.  In  traditional  Smalltalk-80  systems,  each  pointer  is  really 
a  table  index.  The  table  entry  contains  the  target’s  reference  count  and  memory  address. 
This  indirection  required  previous  Smalltalk-80  systems  to  dedicate  base  registers  to  fre¬ 
quently  accessed  objects.  The  overhead  to  update  these  registers  slowed  each  procedure 
call  and  return. 
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Figure  3.12:  Direct  addressing.  A  SOAR  pointer  contain  the  virtual  address  of  the  target 
object.  This  is  the  fastest  way  to  follow  pointers. 


of  die  memory  address  (Figure  3.13).  For  96%  of  die  stores,  list  update  is  unnecessary  and 
the  store  completes  without  trapping  (Table  A 52).  Once  again  we  rely  on  tags  to  confirm 
the  normal  case  and  trap  in  the  unusual  case.  Surprisingly,  tagged  stores  are  so  infrequent 
that  hardware  support  saves  only  1%  of  die  time  and  3%  of  memory  over  an  explicit  check 
(Tables  A.49  and  A.51).  This  feature  does  not  seem  to  worth  the  effort. 


Figure  3.13:  Generation  tag  checking  in  parallel  with  a  store  operation.  The  first  check  ( 
1 1 1 1)  is  for  contexts  and  is  explained  in  Section  3.3.2. 


3iJ.  Activation  Records  as  Objects 

Smalltalk-80  activation  records  pose  a  special  problem.  Since  each  call  needs  a  new 
activation  record,  they  most  be  easy  to  create.  Because  local  variables  reside  in  them,  at 
least  the  current  activation  record  must  be  easy  to  access.  For  these  reasons, 
high-performance  systems  for  other  languages  allocate  activation  records  on  a  stack,  and 
keep  the  active  activation  record  in  registers.  The  problem  for  Smalltalk-80  systems  arises 
because  the  language  specifies  that  the  format  and  lifetime  of  an  activation  record  shall  be 
the  same  as  any  other  object.  In  other  words,  a  Smalltalk-80  activation  record  must  be 
stored  in  memory  with  a  standard  object  header.  Worse,  an  activation  record  cannot  be  deal¬ 
located  until  the  last  reference  to  it  is  destroyed  —  even  after  control  returns  from  it 

SOAR  caches  activation  records  in  an  on-chip  register  file  for  speed,  backed  with  an 
overflow  stack  in  memory.  Pointers  to  activation  record'  are  rare,  so  SOAR’s  hardware 
merely  detects  these  and  causes  a  trap  at  the  appropriate  time.  The  first  trap  occurs  when  a 
reference  to  an  activation  record  is  created.  Pointers  to  activation  records  have  all  the  tag 
bits  set.  When  such  a  word  is  stored  into  memory,  the  tag  check  causes  a  trap.  At  the  time 
of  the  trap,  the  high  order  bit  of  the  activation  record’s  return  address  is  set  Setting  this  bit 
indicates  that  the  activation  record  may  outlive  its  parent.  Since  these  records  are  normally 
allocated  and  freed  last-in-first-out  (LIFO),  we  label  such  anomalously  long-lived  activation 
records  as  non-UFO.  The  return  instruction  then  traps  if  the  renun  address  has  the  high 
order  bit  set  —  this  lets  software  save  this  activation  record  in  the  heap. 

What  if  a  program  references  an  activation  record  while  it  is  still  on  the  stack?  First, 
SOAR  leaves  small  gaps  between  activation  records  when  they  are  stored  in  main  memory. 
These  gaps  are  initialized  with  object  headers  to  permit  the  stored  activation  records  to 
behave  as  objects.  Second.  SOAR'S  hardware  provides  pointer-to-register  addressing.  Each 
load  and  store  checks  if  the  target  address  resides  in  the  on-chip  register  file.  If  so,  the  chip 
substitutes  a  register  access  for  a  memory  access.  This  mechanism  makes  it  possible  to 


access  on-chip  activation  records  as  if  they  were  in  memory. 

Since  designing  SOAR,  we  have  come  up  with  a  software  solution  to  the 
pointer-co-register  problem.  This  scheme  eliminates  the  comparator  and  complicated  control 
logic  incurring  only  a  3%  performance  penalty  (Table  A .53).  The  key  idea  is  to  generate 
illegal  addresses  for  the  unpredictable  but  uncommon  acdvation  record  references,  and  to 
guarantee  that  the  common  and  predictably  referenced  activation  records  reside  in  memory 
when  needed  (Section  A .5.3). 

3i3.  Virtual  Memory 

The  SOAR  system  will  include  dirir  storage  and  thus  supports  virtual  memory.  Sec¬ 
tion  5.4  explains  our  choice  of  demand  paging  over  segmentation.  SOAR  therefore  includes 
a  pin  to  request  a  page  fault  interrupt  The  uniform  size  and  lack  of  side-effects  of  SOAR’s 
instructions  simplify  page  fault  recovery. 

3.6.  Implementation 

In  this  section,  we  give  a  brief  description  of  SOAR’s  implementation  and  microarchi- 
tecture.  This  is  covered  in  more  detail  in  Pendleton's  dissertation  [Pen85b].  The  casual 
reader  may  want  to  skip  this  section;  those  interested  in  details  may  want  to  read  on  and 
learn  about  the  data  path  required  for  SOAR’s  instruction  set.  Although  simpler  than  many 
other  computers,  SOAR’s  implementation  is  substantially  more  complex  than  its  predeces¬ 
sor,  RISC  II. 

3.6.1.  Special  Registers 

SOAR  has  eight  special-purpose  registers  that  simplify  the  instruction  set  and  help 
with  interrupt  handling  (Tables  3.6  and  3.7).  For  instance,  a  register  that  always  contains 
zero  permits  the  assembler  to  synthesize  moves  with  add  instructions.  Making  the  program 
counter  available  as  a  register  provides  relative  addressing  without  adding  another  address- 
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ing  mode.  However,  supporting  unrestricted  use  of  these  registers  would  complicate  SOAR. 
Three  restrictions  apply  to  these  registers: 

•  A  result  written  to  a  special  register  does  not  take  effect  until  the  end  of  the  next 
instruction.  The  SOAR  microengine  cannot  forward  special  registers. 

•  A  special  register  cannot  appear  as  the  destination  of  a  load  instruction. 

•  A  special  register  cannot  appear  in  the  SOURCE2  field  of  an  instruction. 


Table  3.6:  SOAR  special  register's. 

Name 

Symbol 

Reg. 

Bits 

Contents 

Notes 

(zero 

rzero 

rl6 

31:0 

Always  -  0. 

For  synthesizing  instructions. 

program 

counter 

pc 

rl7 

;  27:0 

address  of  next 
instruction 

For  instruction  fetching, 
PC-relative  addressing, 
and  case  statement  indirect 
jump  (rtl).  Should  not 
be  modified  directly,  but 
only  with  jump,  call,  or 
letfinw]. 

Shadow  A 

sha 

rl9 

31:0 

copy  of  A  input 
to  ALU  or  shifter 

Tbe  shadow  registers  track 
instructions  executed  when 

Shadow  B 

shb 

rl8 

;  31:0 

i 

copy  of  B  input 
to  ALU  or  shifter 

interrupts  are  enabled  and 
freeze  when  menupts  are 
disabled.  Thus,  a 
nap-handler  can  save 
time  by  tending  operand 
from  the  shadow  registers 
instead  of  decoding  the 
offending  instruction. 

Trap 

Base 

tb 

r21 

31:10 

base  address  of 
the  interrupt  and 
trap  vector  area 

Saved 

Window 

Pointer 

swp 

r20 

27:4 

memory  address  of 
object  header  of 
the  most  recently 
saved  register  window 

For  pomter-to- register 
logic,  window-overflow 
and  -underflow  trap  logic, 
and  computing  address  of 

Current 

Window 

Pointer 

cwp 

r22 

6:4 

index  of  on-chip 
register  set  serving 
as  high  window 

current  activation  record 

Cwp  controls  local  register 
decoders. 

Processor 

Status 

Word 

psw 

r23 

15:0 

tee  below 

Name 

shadow 

destination 


Notes _ 

For  trap  handier*. 


Bits  |  Contents 


shadow 

destination 

n 

i  destination  register 
field  (bits  22:18) 
of  last  instruction 
executed  with 
interrupts  enabled 

For  trap  handlers. 

software 

interrupt 

enable 

5 

1 

When  this  bit  is  on 
and  a  call  or  jump 
is  executed  with 
;  bit  29  on,  SOAR  takes 
:  a  software  trap. 

For  process  switching. 

interrupt 

enable 

6  ;  Enables  I/O  interrupts 

'  and  shadow  registers. 

Disabled  in  interrupt 
handled. 

i 

7  inert 

Unused. 

shadow 

opcode 

15:8  ;  opcode  field  (bits  30:23) 
of  last  instruction 
i  executed  with  interrupts 
]  enabled 

1 

1 

t 

» 

For  trap  handled  and 
trap  vector  logic. 

CAVEAT:  SOAR  does  not 
support  nested  traps. 

Trip*  taken  when 
interrupts  ire  disabled 
will  not  vector  to 
proper  opcode. 

3.62.  The  SOAR  Datapath 

The  SOAR  datapath  includes  a  register  file,  ALU  (and  byte  shifter),  the  program 
counter,  memory  address  register,  and  saved  window  pointer.  When  reading,  the  busses  are 
first  precharged,  then  two  separate  registers  may  be  read  onto  the  busses.  For  writing,  a  sin¬ 
gle  register  is  addressed,  and  the  data  are  driven  differentially  on  both  busses  (Figure  3.14). 

3.6.3.  Pipelining  in  SOAR 

The  cycle  time  of  SOAR  has  been  matched  to  memory  cycle  time.  Each  instruction  is 
one  word  long  and  most  can  execute  in  one  cycle.  While  one  instruction  executes,  die  next 
is  prefetched  from  memory  (Figure  3.15).  As  described  above,  jumps  and  calls  require  no 
address  computation  and  therefore  cause  no  delay  in  the  pipeline.  Conditional  branches  are 
synthesized  with  a  skip  and  an  unconditional  jump.  This  takes  two  cycles,  which  is  the  same 
as  a  conditional  branch  would  require. 
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Figure  3.14:  The  SOAR  datapath,  “aha  aod  “shb  are  shadow  registers  A  and  B,  “byte 
im/ext“  is  the  byte  insertion  and  extraction  logic,  “dst”  is  the  destination  latch,  and 
"MAL"  is  the  memory  address  latch. 
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Figure  3.15:  Pipelining  in  SOAR.  Although  an  instruction  takes  three  cycles,  SOAR  can 
execute  one  instruction  per  cycle.  Each  cycle  in  turn  consists  of  three  phases. 


The  aaatomy  of  SOAR'*  cycle  it  determined  by  die  fact  that  the  datapath  allows  two 
simultaneous  precharged  reads  or  ooe  write  to  the  register  file.  Each  cycle  is  divided  into 
three  nonoveriapping  phases,  la  phase  ooe,  SOAR  decodes  the  instruction  and  precharges 
die  busses,  la  phase  two,  the  source  registers  are  read  onto  the  busses.  In  phase  three,  the 
ALU  combines  die  two  operands.  Simultaneously,  die  result  from  the  previous  instruction  is 
stored  back  into  its  destination  register.  Thus,  die  result  of  instruction  i  is  not  actually  stored 
into  its  destination  register  until  the  end  of  instruction  i+1.  Forwarding  logic  hides  this 
delay;  if  instruction  i+I  attempts  to  read  the  destination  register  of  instruction  i,  the  desired 
value  is  forwarded  from  a  latch  at  the  output  of  die  ALU.  This  has  a  significant  effect  on 
performance;  if  instead  of  forwarding,  SOAR  stalled  the  pipeline  for  a  cycle  die  benchmarks 
would  ran  15%  slower  (Table  A.54). 

Pendleton  has  proposed  a  rearrangement  of  the  pipeline  diet  would  shorten  SOAR’s 
cycle  time  by  23%  [Fen85b].  However,  the  return  instruction  would  be  one  cycle  longer,  for 
a  total  of  three  cycles  per  return  instruction.  What  would  be  the  net  effect?  On  die  average, 
SOAR  performs  5.4  returns  per  100  cycles  (Table  A.47).  Thus,  the  effect  of  lengthening  the 
return  instruction  would  be  to  execute  5.4%  more  cycles.  Since  the  new  cycle  time  would 
be  25%  faster,  the  new  time  to  run  the  benchmarks  would  be  l.054x75%*79%  of  the  old  time. 
(See  Section  4.1  for  a  description  of  the  benchmarks.)  Rearranging  SOAR's  pipeline  would 
substantially  reduce  execution  time. 

34.4.  Implementation  Statistics 

Table  3.8  contains  some  preliminary  data  for  the  NMOS  SOAR  chip,  taken  from 
[Pen85b].  These  chips  were  fabricated  by  MOSIS  [MOS1S]  and  performed  faster  than  the 
simulators  predicted,  except  for  the  unforeseen  delay  for  jumps  and  calls  described  in  Sec¬ 
tion  3.4.3.  The  MOSIS  NMOS  SOAR  chips  can  execute  an  instruction  every  400  ns,  which 
must  be  derated  to  510  ns  for  the  jump  and  call  delay.  Pendleton  has  perfected  the  host 


Table  3.8:  NMOS  SOAR  characteristics. 

line  width 

4p 

size  (w/  scribe  lines) 

width 

10.7  mm 

height 

8.0  mm 

power  dissipation 

-3  watts 

supply  voltage 

5  volts 

transistors 

35,700 

clocks 

♦1 

90  ns 

underlap 

<10ns 

♦2 

90  ns 

underlap 

<25  ns 

♦3 

145  ns 

underlap 

40ns 

processor  cycle  time 

<400  ns 

fast  shuffle  settling  time 

110ns 

minimum  system  cycle  time 

510  ns 

actual  system  cycle  time 

800  ns 

pads 
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board  for  SOAR,  and  has  successfully  ran  the  entile  diagnostic  suite  on  die  SOAR  chips. 
The  best  SOAR  chip  tested  to  date  functioned  perfectly  with  the  exception  of  a  faulty  bit  in 
coe  register. 


3.7.  Summary 

In  designing  SOAR,  we  have  attempted  to  find  a  few  good  ideas  to  supplement  a  basic 


RISC  for  Smalltalk.  These  are  listed  in  Table  3.9.  As  a  result  of  including  all  these  features, 
SOAR  is  considerably  more  complicated  than  RISC  D.  The  next  chapter  evaluates  our 
architecture,  and  identifies  its  successes  and  failures. 
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Tabic  3.9:  SOAR  Architectural  Ideas. 

Idea 

Section 

From 

31-bit  arithmetic  (with  tag  &  overflow  checking) 

2 

a  tagged/un  tagged  mode  bit  in  each  instruction 

2 

i 

conditional  skips 

2 

PDP-8 

tagged  immediate  values 

2 

compilation  to  low  level  instruction  set 

3 

Risen  i 

uniform  length  instructions 

3 

Risen  ! 

word-addressing  w /  byte-insert  and  -extract 

3 

mips,  pop-io ; 

instructions  tagged  as  integers 

3 

] 

vectored,  prioritized  interrupts  and  traps 

3 

: 

shadow  registers 

3 

! 

in-line  call  target  cache 

4 

Xerox  ST-68K.  1 

software  trap  on  jumps  and  traps 

4 

i 

one-cycle  calls  and  jumps  (fast  shuffle) 

4 

i 

factored  return  instruction 

4 

! 

parallel  register  initialization  on  return 

..  ^ 

; 

load-  and  store-multiple 

4 

IBM-360  ! 

multiple  overlapping  register  windows  on  chip 

RISC  11  i 

noncontiguous  load-  and  store-multiple 

4  !  j 

generation  scavenging 

5 

! 

trapping  stores  of  new  pointers  into  old  objects 

5 

BS  | 

trapping  stores  of  activation  record  pointers 

5 

BS 

trapping  returns  from  referenced  activation  records 

5 

1 

pointers  to  registers 

5 

1 

paged  virtual  memory 

5 

Atlas,  Sun  | 

direct  object  addressing 

5 

BS 

special  registers 

6 

RISC  11  i 

pipelined  data  path  with  forwarding 

6 

RISCII 

offline  reorganization 

BS 

tag  checking  of  addresses  for  load  &  store 

i 

hard-wired  instructions 

RISCII 

Chapter  4 


Performance  Evaluation  of  the  SOAR  Architecture 


4.1.  Introduction 

Can  a  reduced  instruction  set  computer  make  Smalltalk-80  practical?  In  this  section 
we  evaluate  SOAR’s  overall  performance,  place  it  in  context  with  other  Smalltalk-80  sys¬ 
tems,  and  examine  features  in  the  architecture  to  see  which  pull  their  weight  and  which  are 
just  a  waste  of  effort  Toward  this  end,  we  have  analyzed  running  times  and  instruction 
mixes  of  instruction-level  simulations  of  Smalltalk-80  benchmarks  (Figure  4.1). 
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Figure  4.1:  Steps  involved  in  a  SOAR  simulation.  Fust,  rot  removes  the  object  table  from 
the  Xerox  Smalltalk-80  image.  We  then  use  BS  to  make  any  modifications  necessary  in  the 
image  (e.g.  to  eliminate  some  becomes).  Newb2*  produces  a  Smalltalk  image  for  SOAR  by 
convening  the  BS  objects  to  SOAR  format,  and  running  Hilfinger’s  Slapdash  compiler 
which  translates  the  bytecoded  programs  to  SOAR  instructions.  We  have  also  coded  the 
Smalltalk  primitive  operations  and  storage  management  software  in  SOAR  assembly 
language.  After  this  is  assembled,  it  is  fed  to  Daedalus,  our  SOAR  simulator  along  with  the 
Smalltalk  image.  The  initials  below  each  system  indicate  is  author,  ads  is  Dam  Samples, 
phn  is  Paul  Hilfinger,  and  dmu  is  David  Ungir. 


We  have  instrumented  the  SOAR  simulator  to  record  two  types  of  data:  frequencies 
and  profiles.  Obtaining  data  from  the  simulator  makes  it  possible  to  measure  execution 
without  altering  the  program  being  measured.  The  simulator  counts  the  number  of  times  it 
executes  each  instruction,  the  number  of  each  type  of  trap  taken,  and  other  events.  The 
simulator  also  samples  the  program  counter  every  hundred  instructions.  To  gather  die  data, 
we  ran  a  benchmark  once,  reset  the  simulator’s  counters,  enable  profiling,  run  the  bench¬ 
mark  far  a  second  iteration  and  then  dump  die  raw  data  to  files.  (Appendix  B  contains  our 
raw  frequency  data.)  Unix™  utilities  (awk  and  sed)  analyze  the  data  and  report  die  usage 
and  value  of  particular  features.  (Appendix  A  contains  these  results.) 

Xerox  has  defined  an  official  set  of  benchmarks  for  die  Smalltalk-80  system  [McC83]. 
Some  are  called  “micro-benchmarks”  because  they  test  particular  small  operations  like 
integer  addition.  The  rest  are  called  “macro-benchmarks”  because  they  test  large  opera¬ 
tions  like  compilation,  display,  and  exploring  system  organization.  These  are  typical 
high-level  activities  for  Smalltalk-80  programmers.  We  selected  five  macro-benchmarks  for 
our  measurements.  When  writing  Smalltalk-80  programs,  we  spend  more  time  waiting  for 
the  compiler  than  for  anything  else.  For  this  reason,  we  started  with  the  testCompiler 
benchmark.  The  other  four  benchmarks  were  chosen  because  they  did  not  output  to  die 
display  and  did  not  require  substantial  modifications  for  SOAR.  Although  fast  display  out¬ 
put  is  vital  for  Smalltalk,  it  has  been  addressed  by  many  others,  and  is  outside  the  scope  of 
this  dissertation.  The  following  descriptions  of  the  benchmarks  we  chose  quote  from 
fMcC83]: 

testClassOrganizer 

“This  benchmark  measures  the  speed  of  conversion  between  die  textual  and  the  struc¬ 
tural  representations  of  a  class  organization.  The  example  chosen  is  class  Benchmark 
because  its  organization  contains  many  categories.” 


testPrintDefiniUon 


“This  benchmark  measures  how  quickly  a  class  definition,  as  it  appears  in  the  system 
browser,  can  be  generated.  The  example  chosen  is  an  instance  of  class  Compiler 
because  it  has  a  moderate  number  of  instance  variables.” 

testPrintHierarchy 

“This  benchmark  times  the  printing  of  a  portion  of  the  Smailtalk-80  class  hierarchy. 
The  example  chosen  is  class  InstractionStream  because  it  has  several  subclasses.” 

testCompiler 

‘This  benchmark  measures  the  speed  of  the  compiler  on  a  slightly  longer  than  normal 
method,  one  containing  87  tokens  and  compiling  into  73  bytecodes.” 

testDecompiler 

“This  benchmark  measures  the  speed  of  the  Decompiler  by  decompiling  all  the 
methods  in  class  InputSensor.” 

In  addition,  we  used  a  few  micro-benchmarks  to  evaluate  an  upper  bound  for  the  perfor¬ 
mance  impact  of  specific  features: 

testPopStorelnstVar 

“Hus  benchmark  measures  how  quickly  a  value  can  be  popped  off  the  stack  and 
stored  in  an  instance  variable  of  die  receiver.  Because  this  value  is  the  Smalllnteger 
1,  there  is  little  reference  counting  overhead  on  the  push  or  store.  50%  of  the  bytes  in 
the  block  are  16r60,*  a  pop  of  the  top  of  the  stack  into  the  receiver's  first  instance  vari¬ 
able.” 

testSplusd 

“This  benchmark  measures  the  speed  of  Smalllnteger  addition.  Because  all  values 
are  Smalllnlegers.  there  is  little  reference -counting  overhead.  25%  of  the  bytes  in  the 
block  are  16rB0.*  a  quick  send  of  the  message  +.” 


testActivationReturn 


“This  very  important  benchmark  uses  a  call  on  a  doubl  y  -recursi  ve  method  to  measure 
the  speed  of  method  activation  and  return.  There  is  little  reference-counting  overhead 
associated  with  knowing  when  to  end  the  recursion,  but  there  may  be  a  great  deal  in 
managing  the  Contexts  that  represent  the  activations.  About  12.59b  of  the  bytes  exe¬ 
cuted  during  this  benchmark  are  16r£0,*  a  send  of  the  method’s  first  literal  (in  this 
case,  the  Symbol  recur:),  and  about  12.5%  are  returns,  split  evenly  between  16r78,*  a 
quick  return  of  die  receiver,  and  16r7C,*  a  renun  of  the  value  on  the  top  of  the  stack.” 

m 

How  representative  are  these  five  macro-benchmarks?  Xerox  rates  the  performance  of 
Smalltalk-80  systems  relative  to  the  Dorado  by  taking  the  mean  of  the  13  macro-benchmarks 
plus  the  text  scanning  and  BitBlt  micro-benchmarks  [Bay84].  Table  4.1  below  compares  the 
compiler  benchmark,  the  median  of  the  five  macro-benchmarks  used  here,  and  die  Xerox 
performance  rating  for  four  other  Smalltalk-80  systems.  The  data  suggest  that  the  bench¬ 
marks  we  used  slightly  underestimate  overall  performance. 


We  have  not  considered  the  interaction  between  the  availability  of  hardware  features 
and  the  sophistication  of  the  optimizations  performed  by  the  compiler.  The  only  compiler 


Table  4.1:  Comparison  of  Performance  Metrics. 


median  of 
classOrganizer 


Berkeley  Smalltalk  on  Sun  2  [Bay84] 
Tektronix  4404  [Bay84] 

Xerox  PS  on  Sun  2  [Bay85] 

Xerox  PS  on  Sun  3  [Bay85] 

Xerox  Dorado 

SOAR  (simulated  @  400  ns) 


decompiler 

printDefinition 

printHierarchy 


Xerox 

Performance 

Rating 


The  ltr  prefix  denotes  *  hexadecimal  number  For  example.  16r7C  it  124. 


changes  we  have  taken  into  account  aie  those  required  to  simulate  the  missing  hardware. 
For  example,  to  compute  the  overhead  of  software  type  checking,  we  counted  the  number  of 
times  that  hardware  type  checking  was  performed  by  code  from  the  current  compiler  and 
mnltipled  that  count  by  the  cost  of  a  software  check.  It  is  possible  that  a  Smalltalk-80  com¬ 
piler  for  a  machine  without  hardware  support  for  type  checking  would  reduce  the  overhead 
with  a  data-flow  analysis  to  eliminate  redundant  type  checking.  However,  such  techniques 
are  not  used  in  existing  Smalltalk-80  compilers,  which  must  cope  with  dynamic  type  bind¬ 
ing.  The  performance  measurements  in  this  dissertation  hold  only  for  Smalltalk-80  systems 
with  state-of-the-art  compiler  technology. 

42.  Overall  Performance:  SOAR  vs  Dorado 

Can  SOAR  provide  acceptable  performance  with  a  single-chip  processor?  The  Dorado 
is  the  only  Smalltalk-80  system  that  everyone  agrees  is  fast  jnough.  If  SOAR  can  run  as  fast 
as  a  Dorado,  it  will  certainly  provide  a  usable  Smalltalk-80  system.  (The  Xerox  MC68020 
Smalltalk-80  system  is  also  approaching  the  Dorado’s  performance.)  Table  4.2  compares 
SOAR’s  performance  to  the  Dorado  on  five  macro-benchmarks  and  die  procedure  call 
micro-benchmark.  The  Dorado  numbers  were  obtained  from  Xerox’s  Smalltalk-80 
Newsletter  [Bay84].  The  SOAR  numbers  were  obtained  by  simulating  the  benchmarks  for 
two  iterations,  taking  the  number  of  cycles  for  the  second  iteration,*  and  multiplying  by  400 
nst,  our  measured  cycle  time  for  the  4fi  chips.  These  data  show  that  a  400  ns  SOAR  will 
perform  well  enough  to  please  everyone  who  already  uses  Smalltalk-80. 


*  W«  cooud ar  l to  second  aeration  to  be  more  represen  tacve.  Had  we  aaad  tht  oambera  for  tht  drat  iteration,  initial  sub- 
roaiiat  look  apt  woeld  have  slowed  the  faenchaatka  down  by  up  to  10%. 

♦  lap  lamentation  problems  with  dm  fiat  ahoffle  (Section  3.4.3)  will  prevent  fall  spaed  operation  unless  the  memory  cy¬ 
cle  tarn  can  be  ndttced  by  100  at  over  die  chip  cycle  lime.  Alternatively,  the  feat  shuffle  signal  can  be  ignored,  and  die  chip 
could  m  a  a  delayed  breach  archhecmre  (PaaSSaj. 
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43.  Relative  Performance  of  SOAR 

In  die  previous  section,  we  showed  that  SOAR  will  ran  as  fast  as  a  Dorado.  How  does 
this  compare  to  other  Smalltalk-80  systems?  Table  4.3  compares  the  performance  of  the 
compiler  benchmark  on  several  Smalltalk-80  systems.  Both  SOAR  and  die  68010  are 
NMOS  microprocessors,  although  die  68010  has  almost  twice  as  many  transistors  as  SOAR: 
68,000  vs.  35,700.  Since  Deutsch  and  Schiffman’s  ST68K  is  also  a  compiled  implementa¬ 
tion  [DeS84],  it  serves  as  the  fairest  architectural  comparison  to  SOAR.  Unlike  die  ST68K 
code  translator,  the  current  SOAR  compiler  generates  unnecessary  instructions  (see  Table 
2.1 1);  a  better  compiler  would  improve  SOAR's  performance.  By  creating  a  custom  proces¬ 
sor,  we  have  more  than  doubled  performance,  while  halving  the  number  of  transistors. 


Table  43:  Compiler  Benchmark  speed  for  various  Smalltalk -80  systems, 
relative  to  Dorado,  larger  is  faster. 
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time  (ns) 

execution 
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speed 
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interpreter 
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68010 

400 

interpreter 

25% 
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400 

compiler 

40% 

PS 
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68020 

180* 

compiler 

80% 

Dorado 

Xerox 

Dorado 

70 

microcode 

100% 

SOAR 

UCB 

SOAR 

400 

103% 

4.4.  Evaluating  Individual  Features 


Although  SOAR's  design  was  driven  by  empirical  results,  our  experimental  subject  at 
that  time  was  a  bytecode  interpreter,  not  a  SOAR  simulator.  Now  that  we  have  a  compiler, 
simulator,  and  run-time  support  software  for  SOAR,  we  have  been  able  perform  an  accurate 
assessment  its  features  (Table  4.4).  (Appendix  A  contains  detailed  derivations  of  the  data.) 
Each  row  gives  the  feature’s  name,  the  minimum,  average,  and  maximum  effect  it  would 
have  on  speed  were  it  omitted  or  added,  and  the  effect  it  would  have  cm  total  memory  size. 
For  example,  die  tagged  integer  support  is  described  in  Section  3.2.  If  left  out  of  SOAR,  and 
if  the  compiler  were  unchanged,  the  macro-benchmarks  we  simulated  would  take  from  14% 
to  47%  longer  to  run,  with  an  avenge  time  penalty  of  26%.  The  SOAR  Smalltalk-80  virtual 
image  would  grow  by  15%  from  its  13  MB.  Remember  that  (except  for  rearranging  the 
pipeline)  our  performance  figures  count  cycles  and  neglect  the  interaction  between  architec¬ 
ture  and  cycle  time.  For  a  discussion  of  cycle  time  effects,  see  Pendleton’s  dissertation 
fPen85b]. 

Table  4.4  above  groups  die  features  in  the  order  that  they  were  presented  in  die  last 
chapter.  In  Table  4.5,  we  have  reordered  them  by  avenge  performance  impact  and  added 
Pendleton’s  complexity  results  in  order  to  identify  winner  and  losers.  The  complexity  index 
combines  the  number  of  diagnostics,  circuit  blocks,  and  hand-drawn  transistors  required  for 
a  feature.  For  example,  the  most  complicated  feature,  multiple  on  -chip  register  windows, 
has  an  index  of  10. 

The  importance  of  register  windows  on  SOAR  stems  from  an  important  feature  of  the 
Smalltalk-80  system,  fast  compilation.  Like  some  other  exploratory  programming  environ¬ 
ments.  the  Smalltalk-80  system  achieves  split-second  compilation  times  by  compiling  each 
procedure  by  itself;  there  are  no  macros,  interprocedural  analysis,  nor  static  interprocedural 
binding.  Thus,  die  compiler  runs  fast  because  it  has  shed  the  burden  of  binding  or  optimiz¬ 
ing  subroutine  calls.  This  results  in  a  high  frequency  of  subroutine  calls,  which  forces 
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Table  45:  Features  iu  order  of  performance  impact 
Except  for  rearranged  pipeline,  excludes  impact  on  cycle  time. 
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*  Regular  window*.  load-  and  u  on-mu  tuple,  and  pointer-to-registn  ail  internet.  For  example,  without  register  win¬ 
dow*.  load-  end  at  ore- multiple  would  become  much  more  important,  and  pomtrr-to- regular  would  ba  con^lettly  silly. 
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T  ftodletoe  he*  diacovered  that  SOAR'i  implementation  of  this  feature  lengthened  the  cycle  tune  by  -25%.  See  Section 


hardwire  to  shoulder  the  responsibility  for  efficient  execution  of  calls.  This  explains  why 
register  windows  are  so  effective  for  SOAR.  Although  they  add  die  most  complexity  of  any 
feature  [Pen85b],  SOAR  would  run  46%  slower  without  them. 

The  data  suggest  that  we  could  simplify  SOAR  without  sacrificing  much  performance. 
If  we  removed  all  but  the  winning  features,  SOAR  would  only  take  19%  more  time  and  8% 
more  memory.  Adding  Pendleton's  pipeline  rearrangement  would  then  result  in  a  simpler 
design  with  the  same  performance  as  the  original.  If  we  were  to  include  more  features,  they 
might  be  trap  instructions,  loadm/storem,  and  vectored  traps.  Such  a  design  would  be  1 1% 
faster  than  SOAR,  and  use  only  4%  more  memory. 

Four  of  the  features  in  SOAR  are  mistakes:  parallel  nilling,  pointer-to-register,  genera¬ 
tion  tag  hardware,  and  shadow  registers.1*  Although  fully  aware  of  it,  we  still  fell  into  what 
we  now  call  the  “architect's  trap’’  at  least  four  times: 

•  Each  mistake  was  a  clever  idea; 

•  Each  made  a  particular  operation  much  faster, 

•  Each  increased  design  and  simulation  time; 

•  Not  one  significantly  improved  overall  performance. 

Another  way  to  appreciate  the  worthlessness  of  these  four  features  is  that  load/store  byte 
instructions  would  save  more  cycles  than  these  four  put  together. 

We  have  put  these  results  to  use  by  calculating  the  performance  of  some  variations  on 
SOAR  and  comparing  them  to  some  real  systems  (Table  4.6).  Our  predictions  of  SOAR’s 
performance  are  based  on  simulated  macro-benchmark  times  and  do  not  include  virtual 
memory,  operating  system,  and  I/O  overhead.  However,  all  of  the  Smalltalk-80  systems  we 
know  about  tend  to  be  compute-bound  for  program  development  For  a  fair  comparison,  we 

*  Lnedc  md  til  neither  http  not  hinder.  Celling  them  mutekes  u  loo  perjoralive.  w*  would  rather  think  of  them  m  idle 


'  ■»  .'•h*  d*  m*  a*  m  .  V»  •  »  •  -  *  m  'j»  *  ■  ‘  •  %  '.*,*»  "ji  *  . 


assume  a  400  ns  cycle  tune  for  SOAR,  RISC  0,  and  MC68010. 


By  comparing  die  speeds  of  different  systems,  we  can  gain  some  insight  into  the  tea* 
sons  for  SOAR’s  good  performance: 

•  The  speed  ratio  of  full  SOAR  to  RISC  II,  1.6  is  the  same  as  the  ratio  of  RISC  II  to  the 
Xerox  68010  system.  This  indicates  that  die  reduced  instruction  set  architecture 
(including  register  windows)  and  the  Smalltalk-specific  hardware  features  contribute 
equally  to  performance. 

•  Interestingly,  die  Deutsch-Schiffman  68010  compiled  system  is  a  bit  better  than  die 
estimate  for  SOAR  with  only  the  software  ideas.  Perhaps  die  optimizations  in 
Deutsche  compiler  account  for  the  difference. 

•  Since  the  Tektronix  system  neither  compiles  nor  scavenges,  its  software  resembles  a 
stripped  SOAR.  Thus,  the  similar  performance  of  die  Tek  system  to  stripped  SOAR 
suggests  that  the  stripped  SOAR  hardware  performs  as  well  as  the  MC68010. 

The  simplicity  and  high  performance  of  eliminating  all  but  the  winning  features  and  rear¬ 
ranging  SOAR’s  pipeline  make  this  an  appealing  design. 
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4i.  Conclusions 


SOAR’s  hardware  and  software  design  represents  an  advance  for  object-oriented 
experimental  programming  environments.  SOAR  has  almost  half  of  die  transistors  of  the 
68010,  yet  runs  Smalltalk-80  2J  times  faster.  Register  windows,  tagged  integer  instruc¬ 
tions,  direct  pointers,  and  generation  scavenging  account  for  most  of  the  difference.  These 
four  ideas  represent  SOAR’s  most  important  contribution  to  EPE  systems. 

Our  analysis  of  a  feature's  value  was  based  on  counting  cycles.  Barring  the  pipeline 
rearrangement,  we  ignored  die  effect  of  adding  a  feature  on  the  cycle  time  (see  [Pen85b]). 
hi  fact,  some  of  die  features  we  added  to  die  machine  must  have  perversely  increased  the 
cycle  time  enough  to  offset  die  reduction  in  cycles,  thereby  slowing  down  die  system.  In 
particular,  die  hardware  support  for  automatic  storage  reclamation  probably  did  not  speed  up 
SOAR.  Other  examples  of  mistakes  in  SOAR  are  die  inclusion  of  parallel  register  nilling, 
logic  to  support  pointers  to  registers,  and  shadow  registers  to  aid  trap  handling.  We  observe 
that  the  inclusion  of  interesting  features  that  complicate  the  design  but  do  not  improve  die 
performance  of  representative  programs  is  a  trap  that  many  architects  fall  prey  to,  including 
us.* 

There  are  four  places  to  look  for  further  performance  gains:  compiler  technology  (out¬ 
side  the  scope  of  this  dissertation),  implementation  technology  (see  [Pen85b]),  optimization 
of  the  run-time  support  primitives  (which  consume  about  two  thirds  of  SOAR’s  time),  and 
better  hardware  or  software  algorithms  to  cache  call  target  lookups  (which  consume  23%  of 
SOAR’s  time).  Of  these,  implementation  technology  —  circuit  design  and  VLSI  processing 
technology  —  have  the  most  dramatic  impact  Since  we  started  this  project,  the  standard 
VLSI  technology  available  to  universities  has  improved  from  4ji  line  widths  to  3 ft.  This  one 
change  should  reduce  our  cycle  time  from  400  ns  to  290  ns,  as  important  a  contribution  as 

*  flradleton  bM  discovered  IhM  SO  Alt' I  implementation  of  lb*  Fact  Shuffle  neon  *  25%  penalty  whan  (he  chip  u  used 
wife  a  400  ns  namely  system  (Section  J.4.J).  This  dwarfs  the  architectural  benefit  of  an  11%  reducioe  in  the  number  of  cycles 
fa  fine  eaae  the  culprit  waa  oar  failure  to  simulate  the  memory  system  along  with  chip. 
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register  windows.  Another  example  is  Pendleton’s  pipeline  rearrangement  which  could 
improve  performance  by  21%.  This  is  more  than  the  combined  effect  of  parallel  nilling,  trap 
instructions,  loadm/storem,  pointer-to- register,  vectored  traps,  and  generation  tag-checking 
hardware. 

A  70  ns  ECL  Dorado  is  the  only  existing  machine  that  runs  Smalltalk-80  fast  enough 
to  satisfy  everyone,  and  the  400  ns  NMOS  SOAR  chips  that  have  been  fabricated  should  run 
just  as  fast  Thus,  SOAR  will  support  the  Smalltalk-80  system  with  excellent  performance. 


Chapter  5 


Non-Disruptive  High  Performance  Storage  Reclamation 


Throw  back  tbe  little  ones 
and  pan  fry  tbe  big  ones; 
use  tact,  poise  and  reason 
and  gently  squeeze  them. 

Steely  Dan, 

“Throw  Back  the  Little  Ones” 
[BeF74] 


5.1.  Introduction 


Early  in  the  SOAR  project,  we  realized  that  automatic  storage  reclamation  could  easily 
become  a  bottleneck.  We  knew  the  overhead  for  allocation  and  freeing  in  Smalltalk-80  sys¬ 
tems  ranged  from  10%  to  15%  [DeS84,  UnP83],  that  some  reclamation  algorithms  intro¬ 
duced  annoying  pauses,  that  some  required  tbe  programmer  to  explicitly  free  circular  struc¬ 
tures  of  objects,  and  that  most  of  die  algorithms  required  microcode  support.  Since  we 
needed  to  attain  good  performance  in  a  system  without  microcode  we  have  designed,  imple¬ 
mented,  and  measured  Generation  Scavenging,  a  new  garbage  collector  that 


limits  pause  times  to  a  fraction  of  a  second. 


requires  no  hardware  support. 


meshes  well  with  virtual  memory, 


reclaims  circular  structures,  and 


uses  only  3%  of  the  CPU  time  in  SOAR.  This  is  less  titan  a  third  of  tbe  time  of 
deferred  reference  counting,  the  next  best  algorithm.* 


■  Experience  with  SOAK  hi*  Bid*  ut  realist  that  tout  of  the  other  al|ohthat  that  arc  anally  mxrocoded  need  not  be. 
Although  oar  origan]  teaioe  for  aeaiching  for  a  new  algorithm  proved  to  be  unfounded,  we  found  something  that  enjoy*  solid 
advantage*  in  performance  and  the  ability  to  teclum  circular  structures 


This  section  describes  the  challenge  of  providing  automatic  storage  reclamation,  sur¬ 
veys  some  popular  algorithms,  and  presents  our  solution.  It  concludes  by  evaluating  the  per¬ 
formance  of  Generation  Scavenging,  based  on  running  the  Smalltalk-80  benchmarks 
[McC83]  on  BS  and  simulating  them  on  SOAR.  An  earlier  and  shorter  version  of  this 
chapter  appeared  in  [Ung84]. 

52.  The  Relationship  Between  Virtual  Memory  and  Storage  Reclamation 

The  storage  manager  must  ensure  an  ample  supply  of  virtual  addresses  for  new  objects, 
and  must  maintain  a  working  set  of  existing  objects  in  physical  memory.  Traditionally,  the 
functions  have  been  separate d  into  two  parts  as  shown  in  Table  5.1  and  Figure  5.1. 

Sometimes  the  distinction  between  virtual  memory  and  automatic  reclamation  can  lead 
to  inefficiency  or  redundant  functionality.  For  example,  some  garbage  collection  (GC)  algo¬ 
rithms  require  that  an  object  be  in  main  memory  when  it  is  freed;  this  may  cause  extra  back¬ 
ing  store  operations.  As  another  example,  both  compaction  and  virtual  memory  make  room 
for  new  objects  by  moving  old  ones.  Thus  storage  reclamation  algorithms  and  virtual 
memory  strategies  must  be  designed  to  accommodate  each  other’s  needs. 


Table  5.1:  Traditional  decomposition  of  storage  management. 
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Figure  5J:  Virtual  memory  vs.  automatic  storage  reclamation. 


5J.  Personal  Computers  Must  Be  Responsive 


Peisonal  computers  differ  from  time-sharing  systems.  For  example,  with  personal 
computers  there  are  no  other  users  to  blame  for  distracting  pauses.  Yet  personal  machines 
have  time  available  for  periodic  offline  tasks,  for  even  die  most  fanatic  hackers  sleep  occa¬ 
sionally.  Personal  computers  promise  consistently  short  response  times  which  are  known  to 
boost  productivity  significantly  [Tha81]. 


5.4.  Virtual  Memory  for  Advanced  Personal  Computers 

Computers  with  fast,  random  access  secondary  storage  can  exploit  program  locality  to 
manage  main  memory  for  die  programmer.  Advanced  personal  computer  systems  manage 
memory  in  many  small  chunks,  or  objects.  Hie  Symbolics  ZLISP,  Cedar-Mesa,  Smalltalk- 
80,  and  Interlisp-D  systems  are  examples.  Table  5.2  summarizes  segmentation  and  paging, 
the  two  virtual  memory  techniques. 


5.4.1.  Segmentation 

A  segmented  virtual  memory  enjoys  the  flexibility  of  placing  each  object  in  physical 
memory  independently  of  die  other  objects.  This  packing  efficiency  can  result  in  better  use 
of  main  memory  and  a  reduction  in  time-consuming  backing  store  operations.  However, 
segmentation's  performance  advantage  disappears  when  main  memory  becomes  more  plen- 


time  overhead 
first  implemented 
current  example 


Table  5.2:  Segmentation  vs.  Pagin 
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tiful  [Sta82,  So84],  Moreover,  the  variety  and  quantity  of  objects  in  advanced  personal 
computer  systems  pose  tough  challenges  for  a  segmented  virtual  memory.  In  our 
Smalltalk-80  memory  image,  for  example,  die  length  of  an  object  can  vary  from  24  bytes 
(points),  to  128,000  bytes  (bitmaps),  with  a  mean  of  about  SO.  Suppose  segmentation  alone 
is  used.  When  an  object  is  created  or  swapped  in,  a  piece  of  main  memory  as  large  as  the 
object  must  be  found  to  bold  it  Thus,  a  few  large  bitmaps  can  crowd  out  many  smaller  but 
more  frequendy  referenced  objects. 

When  objects  are  small,  it  takes  many  of  them  to  accomplish  anything.  Smalltalk-80 
systems  already  contain  32,000  to  64,000  objects,  and  this  number  is  increasing.  A  seg¬ 
mented  memory  with  this  many  segments  requires  either  a  prohibitively  large  or  a 
content-addressable  segment  table. +  This  large  number  hampers  address  translation. 


5AI  Demand  Paging 

The  simplicity  of  page  table  hardware  and  the  opportunity  to  hide  the  address  transla¬ 
tion  time  make  paging  attractive  to  hardware  designers  [Den70].  Paging,  however,  is  not  a 
panacea  for  advanced  personal  computers.  It  can  squander  main  memory  by  dispersing  fre¬ 
quendy  referenced  small  objects  over  many  pages.  Blau  has  shown  that  periodic  offline 
reorganization  can  prevent  this  disaster  [Bla83d].  The  daily  idle  time  of  a  personal  computer 
can  be  used  to  repack  objects  onto  pages. 

Many  objects  in  advanced  personal  computers  live  only  a  short  time.  The  paging 
literature  contains  litde  about  strategies  for  such  objects.  Since  their  lifetimes  are  shorter 
than  die  time  to  access  backing  store,  these  objects  should  never  be  paged  out  By  segregat¬ 
ing  short-lived  objects  from  permanent  ones.  Generation  Scavenging  permits  them  to  be 
locked  in  main  memory.  Table  3.3  summarizes  the  obstacles  that  advanced  persona)  com- 
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puters  pose  for  a  paged  virtual  memory,  and  die  solutions  that  SOAR  has  adopted.  BS  and 
the  DEC  VAX/Smalltalk-80  system  [BaS83]  use  paging. 


5  J.  Automatic  Storage  Reclamation  for  Advanced  Personal  Computers 

Advanced  personal  computers  depend  on  efficient  automatic  storage  reclamation.  For 
example,  Berkeley  Smalltalk  allocates  a  new  object  every  80  instructions.  This  is  consistent 
with  Foderaro’s  results  for  a  few  voracious  Lisp  programs  [F0F8I].  Since  the  total  size  of 
the  system  was  in  an  equilibrium  for  these  measurements,  the  reclamation  rate  must  match 
die  allocation  rate.  The  mean  dynamic  object  size  is  70  bytes  long.  Thus,  seven  bits  must 
be  reclaimed  for  every  instruction  executed. 

Let’s  examine  several  garbage  collection  algorithms  and  evaluate  their  suitability  for 
advanced  personal  computers.  Where  possible,  we  use  performance  figures  from  actual 
implementations  of  these  algorithms.  The  Xerox  Dorado  Smalltalk-80  system  is  closest  to 
an  advanced  personal  computer,  when  we  try  to  compare  results  we  shall  normalize  to  that 
speed.  For  example,  die  bandwidth  imposed  cm  the  BS  storage  allocator  is 

70  byres  1  object  .  9000  bytecodes  bytes 


1  object  80  instructions  second  second 

If  we  scale  this  up  to  die  speed  of  the  Xerox  Dorado  system,  die  storage  allocation  rate 

exceeds  100KB/S. 

Jon  L.  White  was  one  of  the  first  researchers  to  exploit  the  overlap  between  the  func¬ 
tions  of  virtual  memory  and  garbage  collection,  and  he  proposed  that  address  space  reclama¬ 
tion  was  obsolete  in  a  virtual  memory  system  [Whi80J.  He  pointed  out  that  as  long  as 


Tal  >  5J:  Pasta 


roblem  I  description  SOAR  solution 


internal  fragmentation  1  object  per  page  offline  reorganization 

address  size  need  64K  SO  byte  objects  big  addresses  C228  words) 

paging  short-lived  objects  page  faults  for  dead  objects  segregation  by  age, 

don't  nase  new  ones 


referenced  objects  were  compacted  into  main  memory,  dead  objects  would  be  paged  out  to 
backing  store.  This  strategy  may  have  adequate  performance  as  far  as  CPU  time  and  main 
memory  utilization,  but  it  demands  too  much  from  the  backing  store  in  a  Smalltalk-80  sys¬ 
tem.  Even  if  a  100  MB  backing  store  could  keep  up  with  the  100  KB/ sec  allocation 
bandwidth  it  would  fill  up  in  less  than  an  hour. 

100 MB  I  disk  . 

— —  — - - - -  =  20  minutes. 

100 KB  trash  /  second 

This  is  unacceptable. 

There  are  many  automatic  storage  reclamation  algorithms  [Coh81],  but  they  can  be 
divided  into  two  families:  those  that  maintain  reference  counts  and  those  that  traverse  and 
mark  live  objects.  In  die  next  few  sections,  we  examine  several  reclamation  algorithms  and 
discuss  their  suitability  for  advanced  personal  computers. 

5.6.  Reclaiming  Storage  by  Counting  References 

Reference  counting  was  invented  in  I960  |Col60]  and  has  undergone  many 
refinements  [Knu73,  Sta80],  The  central  idea  is  to  maintain  a  count  of  the  pointers  that 
reference  each  object  If  an  object’s  reference  count  should  fall  to  zero,  the  object  is  no 
longer  accessible  and  its  space  can  be  reclaimed  (Figure  5.2). 

5.6.1.  Immediate  Reference  Counting 

Immediate  reference  counting  adjusts  reference  counts  on  every  store  instruction  and 
reclaims  an  object  as  soon  as  its  count  drops  to  zero.  Both  the  Dorado  Smalltalk-80  system 
[GoR83]  and  LOOM  [KaK83,  Sta82,  Sta84]  reclaim  space  with  this  algorithm.  Compaction 
is  handled  separately  and  typically  causes  a  pause  of  1.3  seconds  every  1  to  20  minutes  on  a 
Sun  68010  workstation. 

Counting  references  takes  time.  For  each  store,  the  old  contents  of  the  cell  must  be 
read  so  that  its  referent’s  count  can  be  decremented,  and  the  new  content's  referent’s  count 


Figure  52:  Standard  reference  counting.  The  standard  reference  counting  algorithm  asso¬ 
ciates  a  reference  count  with  each  object  An  object  is  reclaimed  when  the  count  goes  to 
aero.  Object  3  is  referenced  only  by  itself,  and  is  thus  garbage.  Since  its  count  is  nonzero, 
it  cannot  be  reclaimed  by  a  reference  counting  algorithm. 

must  be  increased.  This  consumes  13%  of  die  CPU  time  [Deu83b,  UnP83].  When  an 
object’s  count  diminishes  to  aero,  the  object  must  be  scanned  to  decrement  the  counts  of 
everything  it  references.  This  recursive  freeing  consumes  an  additional  5%  of  execution 
time  [Deu82a,UnP83].  Thus,  die  total  overhead  for  reference  counting  is  about  20%.  This 
substantial  overhead  is  acceptable  for  personal  computers,  but  deferred  reference  counting 
and  Generation  Scavenging  (discussed  below)  use  much  less. 

Reference  counting  cannot  reclaim  cycles  of  unreachable  objects.  Even  though  the 
whole  cycle  is  unreachable,  each  object  in  it  has  a  non-zero  count  Deutsch  [Deu83b] 
believes  that  this  limitation  has  hurt  programming  style  on  the  Xerox  Smalltalk-80  system 
(which  employs  reference  counts),  and  Lie  berm  an  [LiH83]  has  also  stated  that  circular  struc¬ 
tures  are  becoming  increasingly  important  for  artificial  intelligence  applications.  The  advan¬ 
tage  of  immediate  reference  counting  is  that  it  uses  the  least  amount  of  memory  for  tem¬ 
porary  objects  —  about  IS  KB  when  running  the  Smalltalk-80  macro  benchmarks.  How¬ 
ever.  its  inability  to  reclaim  circular  structures  remains  a  serious  drawback  for  advanced  per¬ 
sonal  computers. 


5.62.  Deferred  Reference  Counting 

The  Deutsch-Bobrow  deferred  reference  counting  algorithm  reduces  the  cost  of  main¬ 
taining  reference  counts  [DeB76].  Three  contemporary  personal  computer  programming 
environments  use  this  algorithm;  Cedar  Mesa,  InterUsp-D  (both  on  Dorados),  and  an  experi¬ 
mental  Smalltalk-80  system  which  furnished  the  performance  measurements  quoted  herein 
[DeS84].  The  Deutsch-Bobrow  algorithm  diminishes  the  time  spent  adjusting  reference 
counts  by  ignoring  references  from  local  variables  (Figure  53).  These  uncounted  references 
preclude  reclamation  during  program  execution.  To  free  dead  objects,  the  system  periodi¬ 
cally  stops,  and  reconciles  die  counts  with  the  uncounted  references.  On  a  typical  personal 
computer  the  algorithm  requires  25  kB  more  space  than  immediate  reference  counting,  and 
averages  30  ms  pauses  every  500  ms. 


Baden's  measurements  of  a  Smalltalk-80  system  suggest  that  this  method  saves  90% 
of  die  reference  count  manipulation  needed  for  immediate  reference  counting  fBad82]. 
Deferred  reference  counting  spends  about  3%  of  the  total  CPU  time  manipulating  reference 
counts,  3%  for  periodic  reconciliation,  and1  5%  for  recursive  freeing.  Thus,  deferred  refer¬ 
ence  counting  uses  about  half  the  time  of  simple  reference  counting. 


_ —  —  > 


Figrre  5J:  Deferred  reference  counting.  The  deferred  reference  counting  algorithm  does 
not  count  references  to  objects  from  the  execution  stack.  A  zero  count  does  not  ensure  that 
an  object  is  reclaimable;  it  may  still  have  references  from  the  stack. 


What  would  be  die  space  cost  for  deferred  reference  counting  on  SOAR?  The  most 
efficient  representation  of  a  reference  count  on  SOAR  would  be  one  word  per  count  Table 
5.4  shows  the  code  sequence  for  reference  counting  on  SOAR.  Since  this  sequence  is  nine 
words  long,  we  can  multiply  the  number  of  tagged  stores  by  nine  to  compute  the  code  over¬ 
head  for  reference  counting  oo  SOAR  (Table  5.5).  This  calculation  shows  that  a  straightfor¬ 
ward  implementation  of  deferred  reference  counting  would  increase  die  image  size  by  16%.* 

Although  more  efficient  than  immediate  reference  counting,  deferred  reference  count¬ 
ing  still  does  not  reclaim  circular  structures.  This  is  its  biggest  drawback. 


5.7.  Reclaiming  Storage  by  Finding  Reachable  Objects 

Marking  reclamation  algorithms  collect  garbage  by  first  traversing  and  marking  reach¬ 
able  objects  aad  then  reclaiming  die  space  filled  by  unmarked  objects.  Unlike  reference 
counting,  these  algorithms  reclaim  circular  structures. 


i  Table  S.4:  Reference  counting  sequence  on  SOAR. 

|~%load  (storeObjjofFset,  oldContents 

j  load  (oldContents  )countOffset,  oldRC  /*  tag  trap  handles  int  case  */ 

J  %skip  eq  oldRC,  1 

I  %call  fireeRoudne 

I  %sub  oldRC  1.  oldRC 

;  %store  oldRC  (oldContents )countOff set 

'  load  (newContentsjcountOffset,  newRC 

%add  newRC  1,  newRC 

%  store  newRC  (newContents)countOffset 


Table  5J:  Static  cost  for  reference  counting  on  SOAR. 


number  of  tagged  store  instructions 

3578 

mean  object  length 

14  words 

total  size  of  image 

1,500  kB 

relative  space  cost  of  code 

8.59% 

relative  space  cost  of  counts 

7.14% 

total  space  cost 

15.73% 

■  The  turn  required  to  manipulate  reference  counts  on  item  u  the  tin  to  adjust  s  count,  perhaps  25  eyelet,  timet  the 
frequency  of  tiff  ad  More  ini  tree  tsont.  or  0.340  (Tibi*  A.47).  divided  by  the  avenge  eyelet  per  itstrectioo.  or  1.5.  Tbit  fives 
an  ettimeie  of  6%.  If  reconciliation  adds  toother  2%.  we  obtain  a  total  of  8%.  which  it  continent  with  Deuucb't  meature- 


5.7.1.  Mark  and  Sweep 

The  first  marking  storage  reclamation  algorithm,  mark  and  sweep,  was  introduced  in 
1960  (McC60j.  It  has  many  variations  [Cob81,  Knu73,  StaSO],  and  is  used  in  contemporary 
systems  [F0F8I].  After  marking  reachable  objects,  the  mark  and  sweep  algorithms  reclaim 
one  object  at  a  time,  by  sweeping  the  entire  address  space.  Fateman  has  found  that  some 
Franz  Lisp  programs  spend  25%  to  40%  of  their  time  marking  and  sweeping  fFat83]  and 
require  about  1.9  mB  for  dynamic  objects  (compared  to  about  I  mB  for  static  objects). 
These  algorithms  are  inefficient  because  they  access  a  large  number  of  objects;  the  marking 
phase  inspects  all  live  objects,  and  the  sweeping  phase  modifies  all  dead  ones. 

The  marking  phase  inspects  every  live  object  and  thereby  causes  backing  store  opera¬ 
tions.*  Foderaro  found  that  for  some  LISP  programs,  hints  to  die  virtual  memory  system 
could  reduce  the  number  of  page  faults  for  a  mark  and  sweep  from  120  to  90  [F6F81].  Even 
with  hints,  marking  and  sweeping  with  paging  causes  on  average  a  4.5  second  pause  every 
79  seconds.  This  is  unacceptable  for  an  interactive  personal  computer. 

5.7  JL  Scavenging  Live  Objects 

The  costly  phase  of  sweeping  dead  objects  can  be  eliminated  by  moving  the  live 
objects  to  a  new  area,  a  technique  called  scavenging.  A  scavenge  is  a  breadth-first  traversal 
of  reachable  objects.  After  a  scavenge,  the  former  area  is  free,  so  that  new  objects  can  be 
allocated  from  its  base.  In  addition  to  the  performance  savings,  a  scavenging  reclaimer  also 
compacts,  obviating  a  separate  compaction  pass.  Scavenging  algorithms  must  also  update 
pointers  to  the  relocated  objects. 

Automatic  storage  reclamation  algorithms  that  scavenge  include  Baker’s  semi  space 
algorithm  [Bak77],  Ballard's  algorithm  [BaS83],  Generation  Garbage  Collection  [LiH83], 
and  Generation  Scavenging.  Baker’s  algorithm  divides  memory  into  two  spaces  and 

*  The  sweep  pkatt  alio  requires  backuf  Moft  operations,  but  its  sequential  nature  accommodates  prefetchmj . 


scavenges  all  reachable  objects  from  one  space  to  the  other  (Figure  S.4).  Ballard  imple¬ 
mented  this  algorithm  for  his  VAX/Smalltalk-80  system  and  observed  that  many  objects 
were  long-lived.  The  addition  of  a  separate  area  for  these  objects  resulted  in  a  substantial 
performance  improvement  by  eliminating  the  periodic  copy  of  them.  Ballard’s  system  has 
600  KB  for  static  objects,  a  S12  KB  object  table,  and  two  1  MB  semispaces  for  dynamic 
objects.  It  spends  only  7%  of  its  time  reclaiming  storage,  including  sweeping  the  object 
table  to  reclaim  entries.  Since  it  is  embedded  in  an  interpretive  system  that  runs 
Smalltalk-80  programs  a  twelfth  as  fast  as  the  Dorado  (Table  2.2),  the  CPU  overhead  for  this 
algorithm  may  rise  above  7%  on  a  high-performance  system. 

Generation  Garbage  Collection  [LiH83]  exploits  the  observation  that  many  young 
objects  die  quickly  and  generalizes  Baker’s  algorithm  by  segregating  objects  into  genera¬ 
tions,  each  within  its  own  space  (Figure  5  .5).  Each  generation  may  be  scavenged  without 
disturbing  older  ones,  permitting  younger  generations  to  be  scavenged  more  often.  This 
reduces  the  time  spent  scavenging  older,  more  stable  objects.  At  present,  there  are  no  pub¬ 
lished  performance  data  on  this  algorithm. 

The  scavenging  algorithms  above  incur  hidden  costs  because  they  interleave  scaveng¬ 
ing  with  program  execution.  The  key  idea  is  to  avoid  pauses  due  to  scavenging  by  subdivid¬ 
ing  die  work  and  scavenging  a  few  objects  every  time  a  new  one  is  allocated.  The  problem 
with  mixing  execution  with  reclamation  is  that  the  program  may  try  to  use  a  pointer  to  an 


Figure  S.4:  Baker  semispaces.  The  Baker  storage  reclamation  algorithm  divides  memory 
into  semispa ces.  When  one  fills  up,  the  live  objects  is  it  are  copied  to  the  other  semispace. 
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Figure  3  J:  Generation  garbage  collection.  Geoemioo  garbage  collection  is  a  generaliza¬ 
tion  of  Baker  semispaces.  This  algorithm  divides  memory  into  many  small  semispaces,  one 
per  "generation.”  When  a  semispace  fills  op,  is  contents  are  scaveaged  to  the  next  one. 

object  that  has  been  scavenged  to  another  area.  This  problem  can  be  solved  by  checking  all 
loads  and  following  the  forwarding  pointers,  but  the  solution  in  turn  imposes  additional 
overhead  on  die  running  program.  Thus,  eliminating  pauses  slows  execution. 

Algorithms  that  segregate  objects  into  generations  must  maintain  tables  of  references 
from  older  to  younger  objects.  These  algorithms  save  time  by  reclaiming  space  in  younger 
generations  without  traversing  older  generations.  The  burden  of  maintaining  these  tables 
falls  on  some  store  instructions. 


5  J.  The  Generation  Scavenging  Automatic  Storage  Reclamation  Algorithm 

Generation  Scavenging  arose  from  our  attempts  find  an  efficient,  unobtrusive  storage 
reclamation  algorithm  for  SOAR  that  did  not  require  microcode.  Our  test  vehicle  was 
Berkeley  Smalltalk,  which  originally  used  reference  counting.  Measurements  of  BS  object 
lifetimes  proved  that  young  objects  die  young  and  old  objects  continue  to  live.  We  then 
designed  Generation  Scavenging  to  exploit  that  behavior  and  substituted  it  for  reference 
counting  in  Berkeley  Smalltalk.  The  result  was  an  eight-fold  reduction  in  the  percentage  of 
time  spent  reclaiming  storage  —  from  13%  to  1.5%.  In  addition,  the  intrinsic  compaction 
provided  by  scavenging  made  it  possible  to  eliminate  the  Object  Table  and  its  accompanying 
indirection.  After  eliminating  the  object  table  and  reference  counting,  BS  ran  1.7  times  fas¬ 
ter  than  before.  In  addition  to  the  performance  improvement  since  Generation  Scavenging 
was  not  based  on  reference  counting,  it  was  able  to  reclaim  cycles  of  unreachable  data  stnic- 
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5i.L  Overview  of  Generation  Scavenging  Algorithm 

Each  object  is  classified  as  either  new  at  old.  Old  objects  reside  in  a  region  of  memory 
called  die  old  area.  All  old  objects  that  reference  new  ones  are  members  of  die  remembered 
set.  Objects  are  added  to  this  set  as  a  side  effect  of  store  instructions.  (This  checking  is  not 
required  for  stores  into  local  variables  because  stack  frames  are  always  new.)  Objects  that  no 
longer  refer  to  new  objects  are  deleted  from  die  remembered  set  during  scavenging.  All  new 
objects  that  are  referenced  must  be  reachable  directly  from  the  old  objects  in  the  remem¬ 
bered  set,  or  through  a  chain  of  new  objects  ultimately  linked  to  die  remembered  set.  Thus, 
a  traversal  in  new  space,  starting  at  die  remembered  set  (and  virtual  machine  registers)  can 
find  all  live  new  objects.  Table  5.6  summarizes  die  characteristics  of  the  two  generations  for 
Generation  Scavenging. 

There  are  three  areas  for  new  objects  (Figure  5.6): 

•  NewSpace,  a  large  area  where  new  objects  are  created, 

•  PasiSurvivor Space,  which  holds  new  objects  that  have  survived  previous  scavenges. 


•  FurureSurvivorSpace,  which  is  used  only  during  scavenging. 

A  scavenge  moves  live  new  objects  from  NewSpace  and  PastSurvivorSpace  to  FutureSur- 
vivorSpace,  then  interchanges  Past  and  FutureSurvivorSpace.  At  this  point,  no  live  objects 


1  Table  5.6:  Generations  in  Generation  Scavenging  for  BS. 

contents 

volatile  objects  permanent  objects 

residence 
space  size 
location 
created  by 
reclaimed  by 
reclaimed  every 
reclamation  takes 

new  space  old  space 

200  KB*  940  KB 

main  memory  demand  paged 

instantiation  tenuring 

scavenging  mirk-snd-sweep 

16  sec  3  -  8  hrs 

0.16  sec  5  min 

are  ten  ui  NewSpace,  and  it  can  be  reused  to  create  more  objects.  The  scavenge  mean  a 
space  cost  of  only  one  bit  per  object  Its  time  cost  is  proportional  to  the  number  of  live  new 
objects  and  thus  is  small  since  only  1  in  20  objects  survive  a  scavenge.  If  a  new  object  sur¬ 
vives  enough  scavenges,  it  moves  to  the  old  object  area  and  is  no  longer  subject  to  online 
automatic  reclamation.  This  promotion  to  old  status  is  called  tenuring.  Figure  S.7  depicts 
both  die  old  and  new  areas  for  Generation  Scavenging. 


SJ1  Detailed  Description  of  Generation  Scavenging 


Recall  that  the  purpose  of  a  scavenge  is  to  transport  the  surviving  new  objects  from 
NewSpace  and  PsstSurvivorSpace  to  Futures urvivorS pace.  A  one-pass  breadth-first  algo¬ 
rithm  copies  the  objects  and  updates  pointers  to  them  as  it  goes  along.  It  starts  by  searching 
all  the  old  objects  in  the  Remembered  set  for  pointers  to  new  objects,  which  it  copies  to 
FutureS urvivorSpace.  Then,  it  updates  the  pointer  to  point  to  the  copy  instead  of  the  origi¬ 
nal,  leaves  another  pointer  to  the  copy  in  the  first  word  of  die  original,  and  sets  a  flag  bit  to 
indicate  that  die  original  has  been  moved.  If  the  scavenging  algorithm  encounters  a  refer- 
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Figure  5.6:  Generation  Scavenging's  three  areas  for  new  objects.  The  largest  area  holds 
newly-created  objects  (NewSpace).  Two  smaller  areas  alternately  hold  objects  that  have 
survived  previous  scavenges  (PastSurvivorSpace)  and  receive  objects  copied  by  the  current 
scavenge  (FumreSurvivorSpace).  This  unbalanced  division  saves  memory  over  a  sem¬ 
ispace  algorithm. 
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Figure  5.7:  Birds  eye  view  of  Generation  Scavenging.  After  an  object  has  survived 
enough  scavenges,  it  is  promoted  to  the  old  object  area.  New  objects  am  locked  down  in 
physical  memory;  old  objects  reside  in  virtual  memory  and  may  be  paged  out 


ence  to  the  same  object  again,  the  flag  bit  and  forwarding  pointer  will  enable  it  to  detect  that 
the  object  has  already  been  scavenged  and  to  update  the  reference.  After  this  first  pass,  all 
new  objects  referenced  by  old  object  have  been  scavenged.  Now,  the  algorithm  starts 
traversing  FutureSurvivorSpace  and  scavenging  any  new  objects  referenced  from  there.  As 
more  objects  are  copied,  the  end  of  FutureSurvivorSpace  grows  away  from  the  scan,  until 
finally,  all  live  new  objects  have  been  scavenged  and  (be  scan  catches  up  to  the  end.  At  this 
point,  the  algorithm  terminates. 

In  addition  to  preserving  live  objects,  those  objects  that  survive  for  a  long  time  must  be 
promoted  into  OldSpace.  If  they  were  not.  much  time  would  be  wasted  copying  and  recopy* 
mg  she  same  objects  back  and  forth.  So,  each  object  includes  a  count  of  the  number  of 


scavenges  it  has  survived.  If  this  count  should  reach  a  certain  threshold,  die  object  gets 
scavenged  to  OtdSpace  instead  of  FutureSurvivorSpace.  At  this  point,  the  object  must  be 
added  on  to  the  end  of  the  renumbered  set  in  case  it  contains  any  pointers  to  other  new 
objects.  After  completing  a  pass,  die  algorithm  checks  the  remembered  set.  If  it  has  grown, 
die  new  part  is  scanned,  which  may  add  objects  to  the  end  of  FutureSurvivorSpace.  Then,  if 
FutureS urvivorSpace  has  grown,  the  new  portion  of  that  area  must  be  scanned,  which  may 
add  objects  to  the  end  of  the  remembered  set.  The  final  form  of  the  algorithm,  therefore 
resembles  two  coroutines:  one  which  searches  the  remembered  set,  and  another  which 
searches  FutureSurvivorSpace  for  pointers  to  new  objects.  This  is  easily  implemented  in  C 
with  two  subroutines  called  alternately  in  a  loop.  The  loop  terminates  when  one  of  the  sub¬ 
routines  completes  without  adding  more  objects  for  die  other  one  to  scan. 


We  now  present  the  Generation  Scavenging  algorithm  top-down,  in  pidgin  C: 
struct  space  { 

word  t  •firstWord;  /*  start  of  space  */ 

int  sue;  /*  number  of  used  words  in  space  */ 

}; 


struct  object  { 

int  size, 
age; 

boolean  isForwarded, 
isRemembered; 

union  { 

struct  object  *contents[], 

*forwardingPointer, 

}; 

}; 


'struct  space  NewSpace,  PastSurvivorSpace.  FutureSurvivorSpace,  OldSpace; 

struct  object  *RememberedSetContentsrMaxRemembered]; 
int  RememberedSetSize; 


The  main  routine,  generationScavenge,  first  scavenges  the  new 
objects  immediately  reachable  from  old  ones.  Then  it  scavenges 
those  that  are  transitively  reachable.  If  thus  results  in 
a  promotion,  the  promotee  gets  remembered,  and  it  first 
scavenges  objects  adjacent  to  the  promotee.  then  scavenges  the 
ones  reachable  from  the  promoted.  This  loop  continues  until 
no  more  reachable  objects  are  left.  At  that  point, 
PastSurvivorSpace  is  exchanged  with  FutureSurvivorSpace. 

Notice  that  each  pointer  in  a  live  object  is  inspected  once  and 
only  once.  The  previousRememberedSetSize  and 
previousFutureSurvivorSpaceSize  variables  ensure  that  no  object 
is  scanned  twice,  as  well  as  detecting  closure.  If  this  were 
not  true,  some  pointers  might  get  forwarded  twice. 


generationScavengeO 

{ 

int  previousRememberedSetSize  ■  0; 

int  previousFutureSurvivorSpaceSize  » 0; 

while  (TRUE)  { 

scavengeRememberedSetStartingAt(previousRememberedSetSize); 
if  (previousFutureSurvivorSpaceSize  ••  FutureSurvivorSpace.size) 
break; 

previousRememberedSetSize  -  RememberedSetSize; 
sea  vengeFutureSurvSpaceS  tarring  At( 
previousFutureSurvivorSpace.size): 
if  (previousRememberedSetSize  »»  RememberedSetSize) 
break; 

previousFutureSurvivorSpaceSize  -  FutureSurvivorSpace.size; 

} 

exchange(PastSurvivorSpace.  FutureSurvivorSpace); 


scavengeRememberedSetStartingAKn)  inverses  objects  in  the  remembered 
set  starting  at  die  nth  one.  If  the  object  does  not  refer  to  any  new 
objects,  it  is  removed  from  die  set  Otherwise,  its  new  referents 
are  scavenged. 


scavengeRememberedSetStartingAt(dest) 

intdest; 

< 

int  source; 

for  (source  «  dest:  source  <  RememberedSetSize;  +-*- source) 
if  (scavengeReferentsOf(ReroemberedSet( source]))  { 
RememberedSetContents[dest^-f]  ■ 

RememberedSetCon  terns  [source]; 

} 

else 

resetRememberedFlag(RememberedSetContents[source]); 
RememberedSetSize  -  dest; 


scavengeFutureSurvSpaceStartingAt(n)  does  a  depth-first 
traversal  of  the  new  objects  starting  at  die  one  at  die  nth  word 
of  FutureS  urvivorSpace. 


scavengeFutureSurvSpaceStartingAdn) 

intn; 

{ 

struct  object  ‘currentObject; 

while  (n  <  FutureSurvivorSpace.size)  { 
scavengeReferemsOf( 

cunentObject  -  FutureSurvivorSpace.firstWordfn]); 
n  +»  sizeOfObject(currentObject)) 

} 


*  scavengeReferentsOf(anObject)  inspects  all  the  pointers  in  anObject. 

*  If  any  are  new  objects,  it  has  them  moved  to  FutureSurvivorSpace, 

*  and  returns  truth.  If  there  are  no  new  referents,  it  returns  falsity. 

*  For  simplicity  here,  an  object  is  just  an  array  of  pointers. 

•/ 

scavengeReferentsOf(  anObject) 
struct  object  *  anObject; 

< 

inti; 

boolean  foundNewReferrent; 
struct  object  ‘referent; 

foundNewReferent »  FALSE; 
for  (i «  0;  i  <  anObject->size;  i++)  { 

referrent  *  anObjecLcontents[i]; 
if  (isNew(referrent))  { 

foundNewReferrent «  TRUE; 
if  (!isForwarded(  referrent)) 

copy  AndForwardObject(  referent); 
anObjectcontents[i] »  referent->forwardingPointer; 

} 

} 

return  (foundNewReferrent); 

} 

/• 

*  copyAndForwardObject(obj)  copies  a  new  object  either  to 

*  FutureSurvivorSpace,  or  if  it  is  to  be  promoted,  to  OldSpace. 

*  It  leaves  a  forwarding  pointer  behind. 

•/ 

copyAndForwardObject(oldLocadon) 
struct  object  ‘oldLocation; 

{ 

struct  object  ‘newLocanon; 

if  (oldLocadon->obj_age  <  Max  Age)  { 

+-*-oldLocation->obj_age; 
new  Location  *  copyObjectToSpace(oldLocation, 
FutureSurvivorSpace) ; 

} 

else 

newLocation  *  copyObjectToSpace(oldLocadon,  OldSpace): 

oldLocadon->obj_forwardingPointer  -  newLocadon; 
oldLocadon->obj  forwarded  TRUE; 


How  do  old  objects  get  reclaimed?  Aa  offline  reclamation  program  traverses  and 
copies  all  objects  in  depth-first  order  to  a  file.  This  is  a  three-pass  algorithm:  The  first  pass 
copies  the  live  objects  to  a  file  and  leaves  forwarding  pointers  in  the  original  objects.  The 
second  pass  traverses  the  file  and  updates  the  pointers.  The  third  pass  reads  the  file  into 
memory,  overwriting  the  original  area.  Copying  rearranges  the  objects  into  depth-first  order, 
which  helps  to  reduce  the  number  of  page  faults  [Bla83b,  Bla83d,  Sta82,  Sta84].  The  whole 
process  takes  a  few  minutes.  If  it  is  only  required  once  or  twice  a  day.  it  should  not  be  too 
disruptive. 

SJJ.  Comparing  Generation  Scavenging  to  Other  Scavenging  Algorithms 
Generation  Scavenging  most  resembles  Ballard's  scheme  [BaS83]: 

•  It  segregates  objects  into  young  and  old  generations. 

•  It  copies  live  objects  instead  of  sweeping  dead  objects. 

«  It  reclaims  old  objects  offline. 

Generation  Scavenging  differs  from  Ballard's  Semispaces  and  Liebennan-Hewitt’s  Genera¬ 
tion  Garbage  Collection  [LiH83].  Unlike  those  algorithms.  Generation  Scavenging 

•  conserves  main  memory  by  dividing  new  space  into  three  spaces  instead  of  two. 

•  is  not  incremental.  Instead,  the  small  pauses  introduced  by  Generation  Scavenging  are 
unnodceable  in  normal  interactive  sessions.  (They  are  noticeable  in  real-time  applica¬ 
tions  such  as  animation.)  Incremental  algorithms  require  checking  on  every  load 
instruction,  and  Generation  Scavenging  saves  this  time  by  not  being  incremental. 
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5  A  Performance  Evaluation  of  Generation  Scavenging 

How  well  does  Generation  Scavenging  perform  in  Berkeley  Smalltalk  and  SOAR?  We 
concentrate  on  four  metrics: 

•  CPU  time  overhead,  the  CPU  time  spent  reclaiming  storage  divided  by  the  total  CPU 
tune  in  the  session, 

•  pause  time,  the  time  that  the  user  must  wait  for  reclamation, 

•  peak  main  memory  usage,  the  amount  of  main  memory  that  must  be  dedicated  for  tem¬ 
porary  objects,  and 

•  backing  store  accesses,  the  number  of  times  that  the  reclamation  algorithm  requires 
data  not  present  in  main  memory. 

5.9.1.  Evaluating  Generation  Scavenging  in  Berkeley  Smalltalk 

The  Smalltalk-80  macro  benchmarks  [McC83]  consist  of  representative  activities  like 
compiling  and  text  editing.  We  measured  the  performance  of  Generation  Scavenging  in  BS 
while  running  these  benchmarks.  Although  our  workstation  had  2  MB  of  main  memory, 
only  about  half  of  that  was  available  to  Berkeley  Smalltalk.  Table  5.7  shows  the  results. 

CPU  Time  Cost  Our  measurements  of  BS  show  that  Generation  Scavenging  requires 
only  1.5%  of  the  total  (user  CPU)  time.  This  is  four  times  better  than  its  nearest  competitor, 
Ballard's  modified  semispaces,  which  takes  about  7%. 

One  reason  that  Generation  Scavenging  looks  so  good  is  that  BS  executes  programs 
more  slowly  than  some  other  Smalltalk-80  systems.  However,  the  next  section  shows  that 
Generation  Scavenging  performs  well  on  fast  Smalltalk-80  systems. 

Main  Memory  Consumption.  Although  each  of  the  three  new  object  areas  occupies 
140  KB  of  virtual  memory  (420  KB  total),  only  28  KB  of  each  survivor  area  gets  used.  The 
rest  serves  as  a  reserve  against  pathological  survival  and  need  not  be  resident.  Thus,  the 


Table  5.7:  Performance  of  Generation  Scavensinc  in  BS 


total  instructions  executed 

4500  k 

amount  of  storage  reclaimed 

3900  KB 

amount  of  tenured  storage 

9.1  KB 

number  of  checked  stores 

190  k 

number  of  remembered  objects 

320 

number  of  scavenges 

32 

mean  length  of  survivors 

4.8  Kword 

total  user  CPU  time 

280  secs. 

total  Real  time 

500  secs. 

real  time  scavenging 

1.8% 

user  time  scavenging 

1.5% 

time  checking  stores 

0.1% 

max  old  space  used 

940  KB 

max  new  space 

140  KB 

max  survivor  space 

28  KB 

total  size 
resident  set  size 


1800  KB 
930  KB 


min  pause  time* 

90  ms 

median  pause  time* 

150  ms 

mean  pause  time* 

160  ms 

90th  %ile  pause  time* 

220  ms 

max  pause  time* 

330  ms 

mean  time  between  scavenges  16  seconds 

iota!  primary  memory  cost  for  dynamic  objects  is  200  KB,  about  10%  of  the  BS  main 
memory.  If  we  used  Baker  semispaces  with  die  same  scavenging  rate,  each  space  would 
need  to  be  140KB  +  28KB,  for  a  total  of  360  KB,  almost  twice  as  much  as  Generation 
Scavenging. 

Backing  Store  Operations.  Since  new  objects  are  always  created  in  the  same  area, 
they  can  remain  in  main  memory.  Unfortunately.  Unix  on  the  Sun  68010  workstation  (Sun 
Release  2.0)  does  not  implement  the  system  call  that  would  lock  down  this  area.  Thus,  the 
first  six  scavenges  caused  283  minor  page  faults  (page  reclaims),  and  the  rest  of  the 
scavenges  caused  four.  With  a  working  set  of  930  KB,  60  major  page  faults  occurred  during 
the  benchmarks. 
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Pauses.  Except  for  the  page  faulting  during  the  first  six  scavenges  (see  above),  the 
pauses  were  small  and  mostly  unobtrusive,  avenging  150  ms.  The  longest  pause  was  only 
330  ms.  About  15%  of  the  pause  time  was  spent  in  die  Unix  kernel  on  unrelated  overhead. 
Since  people  have  difficulty  noticing  pauses  of  100  ms,  this  algorithm's  performance  meets 
our  requirements. 

532  Evaluating  Generation  Scavenging  on  SOAR 

The  previous  section  shows  that  Generation  Scavenging  performs  well  in  BS,  requiring 
fewer  than  1.5%  of  die  CPU  cycles.  How  well  will  this  algorithm  perform  on  SOAR? 
SOAR  will  run  Smalltalk  programs  tea  times  faster  than  BS.  This  will  result  in  ten  times 
more  garbage  created  in  die  same  amount  of  time,  but,  we  would  not  expect  Generation 
Scavenging  to  run  ten  tunes  fester  on  SOAR  than  in  Berkeley  Smalltalk.  If  it  ran  at  the 
same  speed,  then  the  overhead  for  scavenging  on  SOAR  would  be  ten  times  worse,  or  15%. 
In  fact,  as  we  show  in  Section  5.9,  Generation  Scavenging  takes  only  about  2%  of  SOAR’s 
time. 

5321.  SOAR  Scavenge  Duration 

We  have  written  Generation  Scavenging  in  SOAR  assembly  language  and  simulated  it 
in  the  course  of  running  the  macro  benchmarks.  Table  5.8  gives  measurements  of  12 
scavenges,  9  from  the  decompiler  benchmark,  two  from  the  printDefinition  benchmark,  and 
one  from  the  compiler  benchmark.  (See  Chapter  4.1  for  a  description  of  the  benchmarks.) 
As  expected,  the  duration  of  a  scavenge  can  be  predicted  from  the  number  of  words  of  new 
objects  that  survive  the  scavenge.  Figure  5.8  superimposes  the  observed  data  with  a  linear 
regression.  The  regression  predicts  that  the  number  of  cycles  for  a  scavenge  is 
24xi«rvtving  - words  +3500  with  a  correlation  coefficient  r  of  0.976. 

The  last  column  of  Table  5.8  gives  the  duration,  or  pause  time  of  each  scavenge, 
assuming  400  ns  per  cycle.  Despite  identical  cycle  times.  SOAR’s  mean  scavenge  time  was 


Table  5.8:  Statistics  on  twelve  scavenges  simulated  for  SOAR. 
The  last  column  assumes  a  cycle  rime  of 400  ns. 


name  of 

scavenge 

data 

cycles 

^caveng^ 

benchmark 

time 

scavenged 

per 

time 

(cycles) 

(words) 

word 

(ms) 

1 

decompiler 

23 

23 

2 

decompiler 

45,832 

2,028 

23 

19 

3 

decompiler 

45,491 

2,022 

22 

18 

4 

decompiler 

41,262 

1,828 

23 

17 

5 

decompiler 

69,937 

3,114 

22 

27 

6 

decompiler 

37,449 

1,692 

22 

15 

7 

decompiler 

37,157 

1,693 

23 

16 

8 

decompiler 

30,100 

1,489 

20 

12 

9 

decompiler 

29,228 

1,489 

20 

12 

10 

printDefinition 

63,417 

1S42 

25 

25 

n 

printDefinition 

53.535 

2JS11 

21 

22 

12 

compiler 

60,374 

2,834 

21 

24 

min 

1,500 

20 

12 

25%ile 

37,000 

1,700 

21 

15 

median 

45,000 

2,000 

22 

18 

mean 

48,000 

2^00 

22 

19 

(s.d.) 

(13,000) 

(540) 

(1-4) 

(5.0) 

75%ile 

57,000 

2^00 

23 

23 

max 

70,000 

3,100 

25 

27 

scavenge  to  copy  them.  On  the  other  hand.  SOAR  allocates  activation  records  in  a 
separate  stack  that  gets  scanned  rather  than  copied.  The  numbers  show  that  die  aver* 
age  BS  scavenge  copied  4.8  Kwords  whereas  the  average  SOAR  scavenge  copied  only 
2.1  Kwords.  This  accounts  for  23  times  the  work. 

The  above  two  explanations  together  account  for  a  factor  of  4.6,  leaving  a  factor  of  1.8  per* 
formancc  improvement  to  be  explained  by  the  next  two  differences  (which  are  harder  to 
quantify): 

•  Assembly  code  can  be  more  efficient  than  C.  Generation  Scavenging  is  written  in 
assembler  for  SOAR  and  in  C  for  BS. 

•  SOAR’s  architecture  runs  programs  faster  than  the  68010’s.  In  particular,  the  reduced 
instruction  set,  register  file,  word  addressing,  fist  shuffle,  and  tag  checking  hardware 
might  contribute  to  die  performance  improvement  of  scavenging  in  SOAR. 

5.9.22.  SOAR  Scavenge  Frequency 

The  worst  SOAR  scavenge  took  27  ms,  which  is  well  below  the  threshold  for  an 
annoying  pause.  However,  if  the  time  that  a  program  could  run  between  scavenge  and  the 
next  were  too  short,  the  27  ms  pause  would  still  be  unacceptable.  The  length  of  this  gap 
between  pauses  is  determined  by  the  creation  rate  for  new  objects  and  the  by  amount  of 
memory  available  to  hold  diem.  To  measure  this  interval  we  ran  six  benchmarks  on  SOAR 
and  measured  die  rate  of  object  creadoo  during  a  (randomly  chosen)  portion  of  each.  The 
data  are  presented  in  Table  5.9.  With  150  KB  available  for  newly-created  objects.  2.3 
seconds  of  computation  will  be  available  to  amortize  the  27  ms  scavenging  pause.  The  crea¬ 
tion  ram  would  have  to  grow  by  an  order  of  magnitude  to  be  a  problem. 


Table  5.9:  Space  allocation  rate  benchmarks  on  SOAR. 
(Samples  are  complete  second  iterations  of  each  benchmark.) 
(Assumes  new  area  size  *  1 50KB,  cycle  time  »  400  ns.) 


benchmark 

duration 

space 

growth 

growth 

scavenge 

allocated 

rate 

rate 

interval  : 

(cycles) 

(words) 

(w/ltc) 

(kw/sec) 

(secs)  i 

decompiler 

2,958.219 

36.886 

12 

31 

1.2 

printHierarcby 

119,040 

1.426 

12 

30 

1.3  j 

allimplementors 

2,257,051 

18,058 

8.0 

20 

1.9  ' 

printDefinition 

75,319 

509 

6.8 

17 

2.3  ' 

compiler 

1.117,660 

7,467 

6.7 

17 

2.3 

dassOrganizer 

2,959.728 

9.905 

3.3 

8.4 

4.6 

mean 

— 

— 

8.1 

21 

23 

s.d. 

— 

— 

3.4 

8.6 

1.2  ’ 

5.923.  Net  SOAR  Scavenge  Overhead 

Given  the  above  data,  we  can  calculate  the  pause  time,  gap  between  scavenges,  and 
average  scavenge  overhead  (Table  5.10).  The  results  that  generation  scavenging  is 
non-disrupdve;  a  27  ms  pause  every  second  is  hard  to  notice.  Furthermore,  scavenging  uses 
less  than  2%  of  the  CPU  time,  allowing  the  computation  to  proceed  at  full  speed. 

5.92.4.  Generation  Scavenge  Trap  Time 

Recall  that  die  Generation  Scavenging  algorithm  maintains  a  table  of  references  from 
old  to  new  objects.  SOAR  traps  when  it  creates  such  a  reference,  enabling  the  trap  routine  to 
enter  the  address  of  the  referenced  object  in  die  table.  Table  5.1 1  gives  an  analysis  of  store 
nap  overhead  for  the  simulated  macro  benchmarks.  The  path  length  of  100  cycles  for  a  store 
trap  was  determined  by  assuming  a  1  in  8  chance  of  window  overflow,  and  taking  the  worst 
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best  case 

average 

worst  case 

pause  time 

12  ms 

19  ms 

27  ms 

:  scavenge  interval 

4.6  secs 

2.3  secs 

12  secs 

scavenge  overhead 

0.3% 

0.8% 

2.3% 

trapping  overhead 

0 % 

0.05% 

1.0% 

total  overhead 

0.3% 

0.9% 

3.3% 
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cue  for  the  other  branches.  The  wont  case  overhead  to  maintain  the  remembered  set  is  1%, 
with  a  median  of  0.05%. 

SJJ.  Summary  of  Generation  Scavenging’s  Performance 
Table  5.12  summarises  our  findings.  See  Appendix  D  for  a  more  detailed  description. 
Generation  Scavenging  offers  outstanding  performance: 

•  At  3%,  its  CPU  overhead  is  three  times  lower  than  deferred  reference  counting,  its 
nearest  competitor  on  a  compiled  Smalltalk-80  system.  The  overhead  is  so  low  that 
designen  of  high-performance  systems  who  formerly  shunned  automatic  storage  recla¬ 
mation  can  now  embrace  it 


Table  5.11:  Generation  Scavenge  Store  trap 

ping  overhead  in  SOAR. 

Benchmark 

Benchmark 

Cycles 

store 

traps 

store 

trap 

cycles 

store 

trap 

overhead 

Name 

l 

! 

1  decompiler 

2,958,219 

0 

0 

0% 

alilmplememors 

2,257,051 

1 

100 

0.004% 

classOrganizer 

2,959,728 

14 

1,400 

0.05% 

compiler 

1,117,660 

7 

700 

0.06% 

printDefininoe 

75,319 

1 

100 

0.13% 

printHierarchy 

119.040 

12 

1,200 

1.0% 

median 

0.05% 

!  Table  5.12:  Summary  of  Generation  Scavenging’ 

i  Performance. 

Berkeley  Smalltalk 

SOAR 

{  execution  modal 

interpreted 

compiled 

|  source  of  data 

measurements 

simulations 

I  processor 

MC68010 

SOAR 

>  cycle  time 

400  ns 

400  ns 

I  CPU  rime  overhead 

1.5% 

0.9% 

1  worst  case 

n.a. 

3.3% 

pause  time  (scavenge  duration) 

160  ms 

19  ms 

worst  case 

330  ms 

28  ms 

peak  main  memory  usage 

200  KB 

200  KB 

becking  store  accesses 

0.15 

n.a. 

•  The  short  pause  times  for  Generation  Scavenging  are  a  good  match  to  an  exploratory 
programming  environment  Since  people  have  difficulty  noticing  pauses  of  100  ms, 
they  will  not  be  disturbed  by  pauses  of  28  ms. 

•  The  200  KB  of  main  memory  needed  for  temporary  data  exceeds  the  space  require¬ 
ments  of  most  older  algorithms.  However,  given  the  state  of  the  an  in  computer 
memory  hardware.  200  KB  of  overhead  seems  reasonable  for  a  system  with  2  MB  of 
main  memory. 

•  Ideally,  automatic  storage  reclamation  should  not  cause  any  page  faults.  Even  without 
any  provisions  for  locking  new  and  remembered  objects  in  main  memory,  BS  averaged 
only  1  page  fault  per  seven  scavenges. 

5.9.4.  Performance  Evaluation  of  Direct  Addressing  on  SOAR 

Because  Generation  Scavenging  includes  compaction,  the  usual  indirection  through  an 
object  able  is  unnecessary  in  BS  and  SOAR,  making  them  the  only  Smalltalk-80  systems 
without  object  tables.  The  indirection  through  such  a  able  is  sometimes  overlooked  when 
evaluating  reference-counting  reclamation,  but  it  can  be  a  bottleneck;  a  typical  Smallalk-80 
system  accesses  the  object  able  12  times  per  bytecode  [UnP83J.  Assuming  SOAR  per¬ 
forms  as  fast  as  the  Dorado  (300KB. c/. s),  SOAR  would  access  the  object  table  360,000 
times  per  second.  The  absolute  minimum  able  access  would  be  a  single  load  instruction. 
Assuming  400  ns  per  cycle,  such  an  indirection  would  take  two  cycles,  or  800  ns.  At 
360.000  able  accesses  per  second,  that  would  be  0.29  seconds  of  indirection  time  for  each 
second  of  processing  time.  Discussions  with  Deutsch  suggest  that  further  optimization  pos¬ 
sibly  could  halve  this  overhead.  In  other  words,  an  object  able  would  slow  SOAR  by  15% 
to  29%. 

Although  we  eliminated  the  object  able  to  improve  performance,  there  is  one 
Smallalk-80  primitive  operation  that  runs  much  slower  without  it.  The  become:  primitive 


original  \  create  new  array  put  ;  switch  internal  pointer 


\i/  !  \1/  i  \1/ 


Figure  5  JO:  Growing  without  become.  The  sequence  above  illustrates  how  our  modified 
sets  grow  without  resorting  to  become:.  The  contents  are  stored  in  a  separate  array.  To 
grow,  the  set  allocates  a  larger  array,  initialises  it,  and  redirects  an  internal  point  to  the 
new  array.  We  have  replaced  costly  implicit  indirection  with  explicit  indirection  that  incurs 
cost  only  when  needed.  This  is  in  keeping  with  the  RISC  philosophy. 


Tabic  5.13:  Performance  impact  of  eliminating  becomes. 


benchmark  #  becomes  duration  duration  cycles 

w /  becomes  w/o  becomes  saved 


printDefimtion 


75,475 

1,383,201 

4,045,641 

165,997 


75,317 

1,127,658 

3,006,974 

119,574 


any  becomes.  But,  our  efforts  to  eliminate  becomes  from  programs  that  did  use  them  were 
handsomely  repaid  with  an  18%  to  28%  performance  improvement. 


Although  we  have  eliminated  becomes  invoked  by  the  system  classes,  the  SOAR  pro¬ 
grammer  must  either  shy  away  from  this  primitive,  or  be  prepared  to  pay  a  stiff  performance 
penalty.  Forcing  the  user  to  worry  about  the  efficiency  a  primitive  operation  runs  counter  to 
the  philosophy  of  exploratory  programming  environments  in  general  and  Smalltalk-80  in 
particular.  However,  we  believe  that  the  become  primitive  is  so  (intrinsically 
expensive— fast  becomes  require  a  level  of  indirection  that  slows  down  many  frequent 
operations— that  the  effort  to  accomplish  a  become  should  not  be  hidden. 
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We  have  also  estimated  the  impact  of  indirection  on  code  size.  An  Object  Table  would 
require  an  extra  instruction  to  load  or  store  a  literal  variable,  and  one  indirection  in  the 
method  prologue  (for  the  receiver).  (We  are  assuming  that  many  indirections  will  be  optim¬ 
ized  away,  as  in  Deutsch  and  Schififman’s  system.)  Table  5.14  presents  our  analysis  under 
these  assumptions.  The  extra  code  for  an  object  table  would  add  only  2%  to  the  size  of  the 
system. 

5.9.5.  Architectural  support  for  Storage  Management 

The  SOAR  chip  supports  demand-paged  virtual  memory  with  restamble,  fixed  sized 
instructions  and  a  page  fault  interrupt  [SKF85].  An  off-chip  page  map  translates  addresses 
and  maintains  referenced  information.  The  silicon  cost  for  virtual  memory  is  about  20  sup¬ 
port  chips  including  the  page  map.  Figure  5.1 1  shows  that  the  SOAR  host  board  hides  the 
page  map  access  time  in  memory  access  time  [B1D83]. 

To  support  Generation  Scavenging,  all  pointers  include  a  four-bit  tag.  When  a  store 
instruction  stores  a  new  pointer  into  an  old  object,  a  special  trap  occurs.  The  software  trap 
handler  then  records  the  reference.  The  tag -checking  PLA  has  8  inputs  and  one  output,  and 
occupies  about  0.1%  of  the  total  chip  area.  The  cost  of  the  extra  control  logic  to  handle  the 
trap  is  harder  to  measure.  As  mentioned  in  Chapter  4,  tagged  store  instructions  occur  so 
rarely  that  even  this  small  cost  cannot  be  justified. 


Table  5.14:  Static  cost  of  object  indirection. 

method  prologues 

4654 

literal  variable  loads 

3532 

literal  variable  stores 

254 

total  image  size 

1,500  kB 

relative  cost  of  additional  code 

2.25% 

offset  into  ptge 


page  map  physical  page  # 


virtual 

page# 


page  offset  to  RAM 


access  page  map 


page#  to  RAM 


Figure  SJJ:  Fast  address  translation.  The  SOAR  system  has  adopted  the  same  technique 
as  the  Sun  68010  workstation  to  perform  address  translation  without  hurting  performance. 
It  hides  the  translation  time  in  the  address  multiplexing  delay  for  the  dynamic  RAM  chips. 
On  each  memory  access,  the  low  order  address  bits  that  specify  the  offset  into  the  page  are 
sent  to  the  memory  while  simultaneously  reading  the  page  map.  The  physical  page  number 
is  then  sent  to  the  memory  as  the  second  piece  of  the  address.  A  virtual  memory  with  one 
segment  per  object  could  not  run  as  fast  because  the  offset  into  a  segment  is  not  identical  to 
the  least  significant  bits  of  the  physical  address.  Consequendy,  no  portion  of  the  virtual  ad¬ 
dress  can  be  sent  immediately  to  the  RAM  chips. 


5.9.6.  Generation  Scavenging  and  Activation  Records 

We  have  simplified  this  chapter  by  deliberately  omitting  activation  records.  In  this 
section,  we  outline  die  problems  caused  by  activation  records  in  Smalltalk-80  and  our  solu¬ 
tions  to  them.  Activation  records  present  a  problem  because  a  Smalltalk-80  program  can 
manipulate  diem  like  any  other  object.  For  instance,  a  subroutine  can  obtain  a  pointer  to  its 
aedvadon  record  and  place  it  in  a  global  variable.  After  the  subroutine  returns,  another  rou¬ 
tine  can  inspect  the  aedvadon  record  via  die  global  variable.  Since  SOAR  aedvadon  records 
are  kept  in  the  register  frame  stack,  extraordinary  measures  are  required  to  preserve  this 
information.  When  a  Smalltalk-80  program  creates  a  reference  to  an  aedvadon  record  we 
mark  it  as  non-lifo.  When  a  non-lifo  aedvadon  is  about  to  be  destroyed  (i.e.  when  a  return 


instruedon  attempts  to  free  it),  we  copy  the  record  to  the  heap  and  adjust  the  references  to  it. 


Thus,  the  steps  are: 

1)  Detect  the  creation  of  a  non-lifo  reference  to  an  activation  record,  then  mark  the 
activation  record  as  non-lifo : 

A  non-lifo  reference  can  be  created  by  storing  a  pointer  to  an  activation  record  or  by 
returning  such  a  pointer  as  a  result  We  have  allocated  a  distinct  tag  for  activation 
records  (context  or  1 1 1 1 ).  A  tagged  store  instruction  will  trap  when  storing  such  a 
pointer.  As  for  returns,  the  SOAK  compiler  generates  a  trap  instruction  before  each 
return  that  checks  the  tag  and  traps  if  needed.  The  trap  handler  sets  the  high-order  bit 
of  the  activation  record’s  return  address.  This  marks  the  activation  record  as  non-lifo. 
Meanwhile,  the  reference  is  added  to  a  software  table  so  it  can  be  updated  later. 

2)  Detect  a  return  from  a  non-lifo  activation  record,  then  copy  it  and  update  any  refer¬ 
ences  to  it. 

The  return  instruction  traps  if  the  return  address  has  its  high-order  bit  set  This  trap 
handler  then  allocates  space  in  die  new  area  for  the  activation  record,  copies  it  and 
updates  references  to  it  At  this  point  there  is  no  need  to  trap  further  stores,  so  the 
reference's  tag  is  changed  to  new. 

We  have  extended  this  strategy  to  include  blocks.  Smalltalk-80  blocks  implement  con¬ 
trol  structures  by  allowing  one  routine  to  control  execution  in  another's  context  Fre¬ 
quently,  a  block  is  created,  passed  down  the  call  chain  to  a  subroutine  that  repeatedly 
invokes  the  block  and  then  returns.  Thus,  we  must  impose  a  minimum  of  overhead  on  this 
case,  while  handling  non-lifo  references  to  blocks.  In  other  words,  although  a  block  is  an 
object  that  refers  to  a  context  we  do  not  mark  the  context  as  non-lifo  until  the  block  itself 
becomes  non-lifo.  This  is  accomplished  with  the  same  mechanism  outlined  above;  using  the 
context  tag  for  block  objects. 
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5.9.7.  The  Potential  Problem  of  Premature  Promotion 

Recall  that  Generation  Scavenging  is  based  on  the  assumption  that  the  longer  an  object 
survives  the  longer  it  will  remain  alive.  Therefore,  when  an  object  attains  a  ripe  old  age,  it  is 
promoted  from  the  new  generation  to  die  old.  At  this  point,  the  system  assumes  that  the 
object  is  immortal  and  ceases  attempts  to  reclaim  it  For  this  reason,  we  call  the  promotion 
process  tenuring.  However,  in  some  cases  die  object  may  die  shortly  thereafter  and  waste 
space  long  after  its  useful  life. 

At  first  glance,  one  would  expect  dead  tenured  objects  to  waste  backing  storage,  but 
not  main  memory.  They  would  seem  to  get  paged  out  to  make  room  for  enured  objects  that 
remain  alive.  However,  because  an  object  is  so  small  relative  to  the  size  of  a  page  (14  vs. 
1024  words),  a  page  could  easily  contain  just  a  few  live  objects  among  many  dead  ones. 
This  internal  fragmentation  could  tie  up  much  more  main  memory  than  is  actually  needed 
for  the  live  objects.  In  this  manner  dead  tenured  objects  can  increase  the  number  of  pages  in 
the  working  set. 

How  severe  is  this  problem?  We  plan  to  reclaim  dead  tenured  objects  once  a  day  by  an 
offline  reclamation  program.  How  many  will  build  up  in  a  day?  We  won’t  know  until  we 
measure  the  lifetimes  of  objects  over  hours  of  elapsed  time  on  a  high-performance  system 
like  the  Dorado  or  SOAR.  Chapter  6  has  a  more  detailed  discussion  of  this  issue  and  stra¬ 
tegies  for  coping,  should  it  turn  out  to  be  a  problem. 


5.10.  Summary  of  Reclamation  Algorithms 


Table  5.13  summarizes  our  results:  both  Deutsch-Bobrow  deferred  reference  counting 
and  Generation  Scavenging  perform  well  enough  for  an  advanced  personal  computer.  The 
advantages  of  Generation  Scavenging  over  deferred  reference  counting  are: 
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u\ 


it  reclaims  circular  structures, 


•  it  includes  compaction,  and 


•  it  uses  less  than  a  tenth  of  the  total  CPU  time. 


5.11.  Conclusions 


The  combination  of  generation  scavenging  and  paging  provides  high  performance 
automatic  storage  reclamation,  compaction,  and  virtual  memory.  This  method  of  storage 
management  has  proven  its  worth  daily  in  Berkeley  Smalltalk,  which  has  supported  the 
SOAR  compiler  project,  architectural  studies,  and  text  editing  for  portions  of  this  chapter. 

The  algorithm  we  have  presented  may  not  accommodate  objects  that  live  for  a  medium 
amount  of  time;  they  may  increase  die  time  overhead  or  cause  thrashing.  Measurements 
must  be  taken  on  high-performance  Smalltalk-80  systems  to  understand  die  behavior  of 
these  objects. 


Table  5.15:  Summary  of  reclamation  strategies. 


mam  memory  paging  pause  pause 

for  dynamic  I/Os  time  interval 


;e  it,  no  reclamation 


immed  ref.  count 
(compaction) 
deferred  ref.  count 
(compaction) 


mark  and  sweep 
Ballard 


Generation  Scavenging 
BS 

SOAR  best  case 
SOAR  average 
SOAR  wont  case 


15%  -20% 


25%  -40% 
7%* 


1900  KB 
2000  KB 


200  KB 
170KB 
170KB 
170KB 


0.025  1.1 


*  BaUaid't  Smallutt-K)  lyetam  uad  interpretive  (Mention.  Although  mint  >  VAX  11/710  it  me  On  compiler 
macro-fcanchnurt  five  umei  dourer  than  Dreuch  *  deferred  reference  conmm|  dynamically  compiled  Xerox  ST6IK  iy«cm 
(BaSSS.DcSSd).  Bdlant'i  nor  age  reclamation  algorithm  may  wall  exceed  7%  overhead  on  a  corralled  Smalltalk-10  ivetam. 
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High  performance  storage  reclamation  relies  on  two  principles: 

•  Young  objects  die  young.  Therefore  a  reclamation  algorithm  should  not  waste  time  on 
old  objects. 

•  For  young  objects,  fatalities  overwhelm  survivors.  Copying  survivors  is  much  cheaper 
than  scanning  corpses. 

Careful  consideration  of  the  virtual  memory  system  is  essential.  Generation  Scavenging 
combines  these  lessons  to  meet  stringent  performance  goals:  low  time  overhead  (2%  in  BS, 
3%  in  SOAR),  imperceptibly  short  pause  times  (160  ms  in  BS,  27  ms  in  SOAR),  and  a  low 
page  fault  rate  (1.2  faults/ sec  in  BS).  Meeting  these  goals  costs  200  KB  of  primary  memory, 
but  the  result  is  worth  it;  a  high-performance  computer  system  with  fast  automatic  storage 


reclamation. 


Chapter  6 


Scavenging  Data  with  Intermediate  Lifetimes 

i.1.  Introduction 

What  happens  if  the  age  of  an  object  fails  to  predict  its  lifetime?  An  object  that  sur¬ 
vives  long  enough  to  be  promoted  but  succumbs  shortly  thereafter  will  waste  storage  in  old 
space.  This  chapter  contains  a  detailed  description  of  die  problem,  how  we  have  attacked  in 
Berkeley  Smalltalk,  some  proposals  for  extra  generations,  and  an  analytical  model  that  sheds 
some  light  on  the  effect  of  various  parameters  on  performance. 

il  The  Tenuring  Threshold 

When  should  Generation  Scavenging  tenure  an  object?  Since  we  have  observed  that 
young  objects  are  likely  to  die  and  old  ones  are  likely  to  persist,  our  algorithm  tenures  an 
object  that  lives  long  enough.  The  easiest  way  to  measure  age  is  to  count  the  number  of 
scavenges  an  object  survives.  Thus,  each  object  contains  a  byte  that  is  initialized  to  zero  and 
is  incremented  on  each  scavenge.  If  an  object  survives  for  a  certain  number  of  scavenges,  it 
gets  tenured.  The  problem  is  to  choose  this  threshold.  If  it  is  too  small,  that  is  if  Generation 
Scavenging  tenures  objects  too  soon,  a  large  fraction  of  them  will  die  shortly  after  receiving 
tenure.  Tenured  garbage  wastes  space  on  backing  store,  and  more  importantly,  may  slow  the 
system  with  extra  page  faults  by  mixing  dead  and  live  objects  on  the  same  page.  On  the 
other  hand,  if  the  tenuring  threshold  is  too  high,  long-lived  objects  will  pile  up  in  the  new 
area,  increasing  the  amount  of  data  that  must  be  copied  for  each  scavenge.  This  will 
increase  the  pause  time  and  the  CPU  overhead  for  storage  reclamation.  Thus,  the  tenuring 
threshold  must  balance  the  increase  in  page  faults  caused  by  tenured  garbage  against  the 
extra  pause  time  caused  by  scavenging  long-lived  objects. 


In  Berkeley  Smalltalk,  we  have  included  a  feedback-mediated  adaptive  algorithm  to 
set  the  tenuring  threshold.  The  algorithm  examines  the  amount  of  data  that  survived  the  pre¬ 
vious  scavenge  and  adjusts  die  tenuring  threshold  accordingly.  The  current  implementation 
Emits  die  tenuring  threshold  to  64,  where  it  remains  most  of  die  time.  On  SOAR,  a  tenuring 
threshold  of  64  would  mean  that  an  object  would  have  to  survive  for  more  than  a  minute  to 
be  tenured.  Since  the  response  time  for  most  requests  is  much  smaller  than  a  minute,  setting 
the  tenuring  threshold  to  64  would  allow  Generation  Scavenging  to  reclaim  die  bulk  of  the 
garbage  online. 

We  have  performed  an  experiment  with  BS  to  better  understand  tenuring.  Since  the 
objects  of  concern  are  those  that  live  for  relatively  long  times,  a  typical  interactive  session  of 
several  hours  duration  would  be  ideal  for  characterizing  tenuring  behavior.  Berkeley 
Smalltalk’s  poor  overall  performance,  10%  of  a  Dorado,  prevented  us  from  gathering  dau 
from  a  typical  interactive  session.  Lacking  a  Dorado  or  SOAR  chip,  we  settled  for  a  syn¬ 
thetic  workload:  our  image  merely  ran  the  decompiler  benchmark  twenty  times.  The  inter¬ 
val  between  scavenges  was  held  fairly  constant  while  varying  the  tenure  threshold.  A  total 
of  20kw  was  allocated  in  die  new  area  (plus  20kw  for  each  survivor  area).  The  feedback 
mediated  scavenge  algorithm  used  an  average  of  18.7  kw  before  each  scavenge.  Table  6.1 
gives  our  results. 

Figure  6.1  shows  die  relationship  between  die  tenuring  threshold  and  die  number  of 
bytes  of  data  that  were  tenured.  As  expected,  the  number  of  objects  achieving  tenure 
decreases  as  die  time  required  to  obtain  tenure  increases.  In  addition,  there  are  two  knees  in 
the  curve  —  also  just  as  expected.  The  first  knee,  at  a  tenure  threshold  of  one,  merely 
proves  that  most  objects  die  very  quickly.  The  reason  is  that  a  threshold  of  zero  means  that 
every  object  gets  promoted— even  though  it  may  be  only  milliseconds  old — but  a  threshold 
of  one  means  that  an  object  that  gets  promoted  must  be  older  than  the  time  between 
scavenges.  Since  the  scavenges  occurred  every  3.5  seconds,  this  knee  shows  that  many 


Table  6.1:  Results  of  BS  tenuring  experiment 
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83 

310 

16.9 

3.0 

43 

0.8% 

3 

83 

300 

16.7 

3  2 

43 

0.9% 

4 

83 

290 

3.7 

3.4 

4.8 

0.9% 

!  5 

83 

300 

3.7 

3.4 

4.6 

0.9% 

!  6 

83 

300 

3.9 

3.5 

4.6 

0.9% 

j  7 

83 

280 

3.7 

3.5 

4.7 

1.0% 

:  8 

83 

290 

3.6 

3.6 

4.8 

1.0% 

16 

83 

290 

2.9 

3.8 

4.9 

1.0% 

!  32 

83 

300 

2.4 

4  2 

6.9 

1.1% 

I  64 

83 

290 

2.0 

5.1 

6.4 

1.4% 

objects  live  less  than  3  3  seconds. 

The  second  knee,  at  4,  indicates  that  many  objects  live  for  more  than  3x3.5  seconds  but 
less  than  4x3.5  seconds.  This  is  not  surprising  because  each  iteration  of  the  benchmark  took 
about  12  seconds,  the  only  objects  tenured  at  a  threshold  of  4,  were  those  that  survived  for 
more  than  one  iteration.  These  were  die  text  lines  printed  on  the  screen  from  the  bench¬ 
marks.  This  experiment  confirms  our  understanding  of  tenuring;  any  object  which  outlives 
die  product  of  the  tenuring  threshold  and  the  inter- sc  avenge  tune  gets  tenured. 

Although  minimizing  the  amount  of  tenured  data  saves  (virtual)  memory  space  and 
improves  paging  performance,  it  forces  the  scavenge  operation  to  copy  more  survivors, 
which  takes  more  time.  The  surprise  is  how  small  this  increase  is.  In  this  experiment,  die 
quantity  of  tenured  data — which  is  principally  garbage — decreased  by  a  factor  of  23,  while 
the  time  spent  on  scavenging  merely  doubled. 

Unfortunately,  we  would  need  measurements  of  a  fast  Smalltalk-80  system  to  com- 
pleiely  predict  the  effects  of  tenuring.  Tenuring  affects  objects  that  live  for  minutes  or 
hours.  These  objects  are  used  by  people,  not  programs.  For  example,  the  objects  that 

*  a«Md  on  24  eyelet  ■  Mirvivor  *  3500  M  derived  in  Socuon  5.9.2. 1. 


Two  generations  with  fast  tenuring.  This  is  the  present  configuration.  Deutsch  has 
estimated  that  data  structures  used  by  a  typical  window,  for  example  a  browser,  con¬ 
sume  15  KB  of  memory.  At  20  cycles  per  word,  that  means  that  it  would  take  30  ms  to 
scavenge  the  dam  for  a  window.  Thus,  assuming  150  KB  of  new  space,  every 
untenured  window  would  add  3%  to  the  scavenging  overhead,  limiting  die  number  of 
untenured  windows  to  about  4.  If  the  rate  of  window  creation  is  slow  enough,  a  system 
that  tenures  objects  so  fast  that  every  window  gets  tenured  may  be  practical.  On  the 
other  hand,  if  many  windows  are  created  and  immediately  destroyed  (as  in  the  case  of 
error  message  windows)  it  may  be  important  to  retain  a  few  untenured  windows. 

Two  generations  with  slow  tenuring.  Assume  we  dedicate  a  megabyte  of  physical 
memory  to  new  objects.  Then  the  system  can  run  seven  seconds  between  scavenges. 
That  means  that  a  more  data  can  be  scavenged  without  incurring  incurring  excessive 
overhead.  In  fact,  the  limit  becomes  the  scavenge's  pause  time,  not  the  percentage  of 
overhead.  Suppose  that  we  accept  a  fifth-second  pause  every  seven  seconds.  That  is 
long  enough  to  scavenge  seven  windows.  This  may  be  a  sufficient  number  of 
untenured  windows  to  avoid  tenuring  garbage.  (Interestingly,  seven  is  roughly  the  size 
of  a  human  short-term  memory.) 

Three  generations  with  fast  tenuring.  Suppose  we  add  a  third  generation  in  die  middle. 
Some  of  the  space  for  the  third  generation  can  be  obtained  by  reducing  the  size  of  the 
youngest  generation  from  100KB  to  50KB,  which  triples  the  scavenge  overhead  to  a 
(still  acceptable)  3%.  A  middle  generation  of  300KB  of  physical  memory  can  contain 
ten  un tenured  windows  (in  each  semispace).  The  time  for  a  scavenge  of  the  middle 
generation  would  be  about  300  ms.  This  option  can  support  about  the  same  number  of 
windows  as  the  two  generation,  slow  tenuring  one.  but  with  slighdy  more  space  and 
significandy  less  time  overhead. 


4.  Three  generations  with  slow  tenuring.  Suppose  we  add  a  large  third  generation,  but 
use  virtual  memory  instead  of  physical.  Scavenging  this  middle-aged  generation 
would  then  incur  page  faults  and  cause  a  perceptible  pause,  perhaps  one  to  three 
seconds.  However,  30  windows  could  be  created  before  filling  (tbe  1/2  MB  semispace 
of)  a  one  megabyte  generation.  Thus,  these  long  scavenges  would  be  infrequent,  and 
acceptable. 

3.  Four  generations.  SOAR's  tags  support  four  generations,  so  we  could  combine  the 
above  schemes.  The  youngest  generation  would  be  small,  locked  into  memory,  and 
frequently  scavenged.  An  object  surviving  two  scavenges  would  be  promoted  into  die 
next  generation.  This  would  also  be  in  physical  memory,  but  larger.  This  generation 
would  bold  the  newest  few  windows.  Thus,  this  is  important  if  many  windows  are 
closed  immediately.  The  third  generation,  would  be  about  a  megabyte,  and  located  in 
virtual  memory.  Most  windows  and  medium  lifetime  objects  would  reside  here.  They 
could  be  reclaimed  without  a  complete  reorganization.  Finally,  permanent  objects  like 
the  square-root  routine  would  reside  in  the  oldest  generation,  which  would  be 
reclaimed  and  reorganized  offline.  Table  6.2  summarized  these  proposals.  More  work 
is  needed  to  measure  die  behavior  of  these  medium  lifetime  objects  and  to  design 
appropriate  two-  or  three-  generation  parameters  and  reorganization  algorithms. 


6J.  Analysis  of  a  Single  Scavenged  Generation 

How  much  physical  memory  must  be  dedicated  to  new  objects?  In  this  section  we 
present  an  analysis  of  a  two-generation  system  where  one  generation  is  scavenged  (New) 
and  die  other  is  reclaimed  offline  (Old).  Since  the  Old  objects  are  reclaimed  offline,  we  will 
only  analyze  the  New  generation  here.  Table  6.3  introduces  the  relevant  terms.  The  first 
constraint  we  face  is  to  keep  the  scavenge  pauses  small  enough  to  be  unobtrusive.  The  data 
on  scavenging  duration  in  the  previous  section  showed  that  the  length  of  a  scavenge  can  be 
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Table  6.2:  Summary  of  tenuring  proposals, 
generation  jj  assistant  |  associate  full  |  ementus 

type  of  memory  ||  physical  virtual 


Proposal  J.  Two  generations,  fast  tenuring. 


creation  area 

(KB) 

140 

4,000 

gap  tine 

(tec) 

1 

9  | 

* 

survivor  area 

(KB) 

17 

disk 

pause  tune 

(ms) 

30 

60  | 

scavenge  time 

<%) 

3% 

?  ! 

primary  memory 

(KB) 

170 

2.000 

Proposal  2 

Two  generations,  slow  tenuring. 

creation  area 

(KB) 

420 

4,000  ! 

gap  time 

(sec) 

3 

?  1 

:  survivor  area 

(KB) 

170 

disk  ; 

pause  time 

(sec) 

0.30 

60 

scavenge  time 

(*) 

10% 

?  I 

primary  memory 

(KB) 

760 

2,000 

Proposal  J 

.  Three  generations,  fast  tenuring. 

creation  area 

(KB) 

140 

0 

4,000 

gap  time 

(sec) 

1 

600 

?  ! 

survivor  area 

(KB) 

17 

ISO 

disk  1 

pause  time 

(sec) 

0.030 

0.30 

60 

scavenge  time 

(%) 

3% 

0.05% 

?  1 

primary  memory  (KB) 

170 

300 

1  -3  MB 

Proposal  4.  Three  generations,  slow  tenuring.  I 

creation  area 

(KB) 

140 

0 

3,000 

gap  time 

(sec) 

1 

2,000 

? 

survivor  area 

(KB) 

17 

500 

disk  j 

■pause  time 

(sec) 

0.030 

'10 

60 

scavenge  time 

(%) 

3% 

0.5% 

? 

primary  memory  (KB) 

170 

500 

0.5  -  2  J  MB  i 

Pro 

posal  5.  Four  generations.  1 

creation  area 

(KB) 

140 

0 

0 

3,000  i 

gap  time 

(sec) 

1 

600 

20.000? 

?  , 

survivor  area 

(KB) 

17 

150 

500 

disk  ' 

pause  time 

(sec) 

0.030 

0.30 

*10 

60 

scavenge  time 

(%) 

3% 

0.05% 

0.05%? 

7 

primary  memory  (KB) 

170 

300 

500 

0.5  -  2.5  MB 

Table  63:  Quantities  to  analyze  a  single  generation. 

i  symbol  1  description 

units 

constants 

ct 

SOAR  cycle  time 

seconds 

u 

scavenge  effort:  avg.  cycles  per  scavenged  byte 

cycles  per  byte 

abw 

allocation  bandwidth:  rate  of  new  data  instantiation 

bytes  per  second 

independent  variables 

mrv 

size  of  each  survivor  area 

bytes 

Eden 

size  of  new  object  creation  area 

bytes 

dependent  variables 

!  mam 

total  memory  used 

bytes 

'  pause 

length  of  scavenging  pause 

seconds 

t°P 

gap  between  scavenges 

seconds 

ov 

fraction  of  CPU  used  for  scavenging  this  generation 

fraction  [0, 1] 

predicted  from  the  amount  of  data  surviving  the  scavenge. 


pause  » (lexer yxsurv  (]) 

Let’s  test  this  with  an  example.  Plugging  in  typical  SOAR  parameters  cr  *  400ns, 
se  ■  5.5cyc  /byte ,  and  run.-  ■  &.&OQ  bytes : 


pause  »(5Jx400»u)x8.800*  19*m  (IE) 

which  m«trl>g«  the  simulated  pause  time  of  19  ms. 

Reducing  die  tenuring  threshold  will  limit  die  quantity  of  data  that  survives  a  scavenge 
by  promoting  the  oldest  surviving  objects.  Once  in  Old  space,  they  need  not  be  scavenged. 
But,  as  discussed  in  the  previous  section,  too  much  tenuring  can  provoke  thrashing.  Thus, 
we  recommend  choosing  an  acceptable  pause  time  (perhaps  from  10  ms  to  100  ms)  and 
adaptively  adjusting  the  tenure  threshold  to  maintain  the  corresponding  amount  of  untenured 
data. 


The  next  step  is  to  calculate  the  amount  of  memory  devoted  to  newly-created  objects. 
Let’s  assume  that  the  rate  of  object  allocation  is  fairly  constant.  Then 


OTP 


Eden 

ab* 


(2) 


For  example,  in  the  growth  rate  experiment  in  the  previous  section,  we  found  that  the  com¬ 
piler  benchmark  generated  17,000  words  per  second.  Thus,  ab*  *  68,000<wrj  /sec ,  so  for 


■I'MI 


Eden  -  150.000, 


150.000 

gap  *  *  eJlsec 

68,000 


In  other  words,  with  150  KB  for  new  objects,  SOAR  could  run  for  two  seconds  between  suc¬ 
cessive  scavenges. 


Although,  ov  »  — — ,  we  will  use  a  simpler  approximation, 
pause +gap 


ps  me 


for  our  analysis.  (This  is  a  reasonable  approximation  because  we  only  care  about  systems 
with  low  overhead.)  Continuing  with  our  example,  we  can  use  equation  (3)  to  calculate  the 
nine  overhead: 

1  QpH  * 

ovm-2? £-,0.86*  (3E) 

22 tec 

Since  we  have  expressions  for  die  pause  and  gap  times,  we  can  combine  (1),  (2),  and 
(3)  to  express  the  overhead  in  terms  of  memory  allocations: 


Eden  ( sexctxabw )  ’ 

Suppose  we  need  to  decide  how  much  memory  to  allocate  for  Eden  in  SOAR: 

MOO  m  jn_ 

Eden  0.15 

Edenxov  *  1300KB  (4E) 

So,  for  2%  overhead,  we  would  allocate  65  KB  to  Eden.  This  would  total 
2x8600+65.000  *  82 KB  of  main  memory  for  New  objects. 

For  the  general  case  we  can  combine 

mem  ■  Eden  +2 xsun-  (5) 

with  (4)  to  calculate  the  total  memory  required.  Suppose  we  built  the  system  as  described 

above,  only  to  discover  that  it  tenures  too  much  garbage.  The  first  step  to  cut  down  on 

tenuring  would  be  to  boost  the  quantity  of  untenured  survivors.  This  will  increase  the  pause 
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time  for  a  scavenge;  equation  (1)  says  that  sun  *  -P?*3*  .  Thus,  50  KB  of  survivors  will 

2x10 

result  in  pauses  that  last  100  ms.  The  increased  pause  time  will  drive  up  CPU  overhead 
unless  we  dedicate  more  memory  to  Eden.  Suppose  we  allow  CPU  overhead  to  rise  to  5%  to 
economize  on  memory,  then  equation  (4)  gives  the  size  of  the  Eden  area  required. 


50,000 

Eden 

Eden 


0.05 
0.15 

50,000 

0.33 


>  0.33 


150,000 


Equation  (5)  then  supplies  the  total  memory  for  this  generation: 


memory  m  150.000+  2x50,000  *  250,000  (5E) 

6.4.  Analyzing  a  Middle  Generation 

What  if  this  is  still  not  enough  space  for  medium-lifetime  objects?  A  third  generation 
can  be  added  in  the  middle.  This  results  in  a  system  with  three  generations:  a  generation  for 
evanescent  objects  (Generation  ]>,  a  generation  for  medium-lived  objects  (Generation  2), 
and  a  generation  for  permanent  objects  (Figure  6.2).  Assuming  that  we  keep  Generation  2  in 
primary  memory,  how  are  we  going  to  divide  memory  among  the  two  scavenged  genera¬ 
tions?  The  equations  in  die  previous  section  specify  the  behavior  of  a  single  scavenged  gen¬ 
eration,  so  we  can  apply  them  to  each  of  the  two  scavenged  generations,  using  subscripts  to 
indicate  the  generation.  Then,  by  superposition  from  (4): 


OV  »0V,+0V2  * 


(se  txct  txabw  t)sun  i  (se  2)sun  2 


(6) 


Eden ,  Eden  2 

For  example,  assume  that  each  window  uses  IS  KB  of  data,  and  that  we  want  to  be  able  to 
support  ten  windows  without  tenuring.  Then  surv2  =  15 0KB.  If  we  open  one  window  per 
KB 

minute,  ab h  2  *  15—  *  230byres  / sec .  (Se  and  ct  are  the  same  for  both  generations.)  Thus. 


OV  *0y,MV,a 


1300 


74 


Continuing  with  our  example, 

Eden ,  j 

Eden  V7, 


>  81%  and 


Eden  2 


Given  an  optimal  split,  we  can  plug  (8)  into  (5)  to  find  die  minimum  amount  of  over¬ 
head  for  a  given  amount  of  memory: 

ovxEden  *  £\/(ir  jxn  ,xabw  ,)surv~,  + V(ae  ftct&abw  ^surv  2  j  (9) 

For  our  example, 

ov  xEden  •  [VHOO+V74 f  «  2000  (9E) 

So,  for  2%  overhead,  100  KB  of  Eden  would  be  needed.  Adding  in  the  survivor  areas,  420 
KB  of  physical  memory  would  be  used  for  scavenging.  What  about  those  long  pauses  for 
Generation  2?  From  (1),  pause  2*  150,000xse  xcr  *  300ms.  From  (5), 


J  0.19x100X8 


1 76  secs .  Thus,  by  adding  a  middle  generation,  we  have  made 


it  possible  to  scavenge  more  un tenured  data  by  increasing  the  gap  between  long  scavenges. 
This  lets  us  keep  160  KB  of  untenured  data  in  420  KB  of  main  memory  at  a  time  cost  of 


2.0%. 


We  may  decide  that  minimizing  the  total  CPU  overhead  is  not  as  important  as  reducing 
the  frequency  of  long  pauses.  In  that  case,  we  can  abandon  (8)  and  use  (1)  and  (2).  Suppose 
we  can  only  tolerate  a  300  ms  pause  once  every  3  minutes.  Then,  using  (2) 
Eden 180x250  =  45 KB.  Assuming  we  use  the  same  amount  of  memory  as  above,  that 
leaves  53  KB  for  Edenv  This  results  in  a  0.81  second  gap  for  Generation  1.  With  these 

19  300 

parameters  the  total  overhead  is  —  — —  =  2.5%.  Of  course,  this  is  worse  than  the 

810  180,000 

optimal  overhead  of  2.0%. 
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ii.  Controlling  the  Tenuring  Threshold 


Objects  must  be  tenured  to  avoid  excessive  pauses  caused  by  scavenging  too  much 
data.  The  problem  is  to  set  the  tenure  threshold  given  die  survivors  from  die  past  generation. 
We  propose  that  a  scavenge  also  maintain  a  table  giving  the  total  amount  of  surviving  data 


for  each  age.  Such  a  table  could  then  be  used  to  predict  the  amount  of  data  that  would  be 
promoted  for  any  given  tenure  threshold.  Building  this  table  would  add  about  10%  to  the 
scavenge  time. 


s*; 


6.6.  The  Cost  of  an  Offline  Reorganization 

To  better  understand  die  time  required  by  an  offline  reorganization,  we  measured  one 
on  BS,  on  a  diskless  Sun  68010  workstation.  Table  6.4  gives  the  results:  this  reorganization 
software  is  slow;  1200  memory  cycles  are  expended  in  user  mode  on  each  word.  Address 
space  limitations  of  early  Suns  forced  us  to  reorganize  the  old  objects  by  copying  them  to  a 
file,  and  modifying  them  in  the  file.  Thus,  every  time  a  word  is  read  from  old  space,  a  file 
read  subroutine  is  called.  Current  Suns  and  SOAR  have  16  MB  of  address  space,  more  than 
enough  to  hold  a  copy  of  die  1  MB  to  2  MB  of  old  space.  Replacing  file  read/write  software 
with  virtual  memory  hardware  should  result  in  a  large  speed  up,  and  a  sub-minute  reorgani¬ 
zation  seems  feasible. 


Table  6.4:  Measurements  of  an  offline  reorganization  on  BS. 

user  time  ! 

116.7 

system  time  j 

46.1  sec 

real  time  i 

179  sec 

idle  dme 

16  sec 

CPU  utilization 

90.9% 

reads 

464 

writes 

492 

page  faults  ! 

14 

initial  old  size 

243,036  words 

final  old  size 

231,207  words 

bandwidth 

480  ii&'word 

16-bit  cycles/ word 

1200 

y-rf 


v\- 

M 


4.7.  Summary 


Objects  that  live  long  enough  to  be  promoted  but  die  shortly  thereafter  can  present  a 
problem  for  Generation  Scavenging.  To  study  this  phenomonon,  we  would  need  data  from 
sessions  on  high-performance  systems  using  Generation  Scavenging.  Since  we  do  not  have 
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Conclusions 


7.1.  Conclusions 

.  We  have  presented  and  evaluated  the  hardware  and  software  design  of  Smalltalk  On  A 
RISC  (SOAR).  We  undertook  this  effort  to  see  how  well  the  reduced  instruction  set  com¬ 
puter  style  of  system  design  would  work  for  a  software  environment  heretofore  supported 
only  by  complicated  virtual  machines.  It  has  worked  very  well  indeed.  A  combination  of 
hardware  and  software  strategies  has  allowed  us  to  build  a  single-chip  NMOS  microproces¬ 
sor  that  will  match  die  performance  of  an  ECL  minicomputer,  despite  a  5:1  cycle  time  han¬ 
dicap.  With  about  half  of  the  transistors  of  the  MC68010  microprocessor,  a  400  ns  SOAR 
wiU  run  die  Smalltalk-80  system  25  times  faster  than  the  400  ns  MC68010.  With  only  one 
fifth  of  the  transistors  of  die  MC68020,  and  with  a  handicap  of  about  a  factor  of  two  in  cycle 
time,  SOAR  win  outrun  the  MC68020.  RISCs  pay  off  for  experimental  programming 
environments. 

SOAR’s  performance  comes  at  a  price;  namely,  memory  space.  A  bytec oded  32-bit 
Smalltalk-80  image  occupies  a  megabyte  of  memory.  Generation  Scavenging  adds  200  Kb 
to  this,  and  compiling  to  a  simple  instruction  set  costs  another  500  Kb.  With  current 
hardware  technology,  the  extra  700  Kb  is  a  small  price  to  pay  for  high  speed. 

The  most  important  hardware  features  are  register  windows  and  tagged  integer  instruc¬ 
tions.  These  two  features  nearly  double  SOAR’s  performance  by  reducing  the  cost  of  sub¬ 
routine  calls  and  type-checked  integer  operations.  Other  important  hardware  features 
include  byte  insert/extract  instructions,  two-tone  instructions,  forwarding,  one  cycle  jumps 
and  calls,  and  tagged  immediate  data,  in  the  realm  of  software,  our  storage  management 
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strategies  (discussed  below),  direct  pointers,  in-line  caching,  and  compiling  to  a  simple 
instruction  set  are  essential  In  addition  to  permitting  fast  instruction  decoding,  the  simpli¬ 


city  of  the  base  architecture  enables  us  to  add  die  language-specific  extensions. 

On  die  other  hand,  despite  our  best  intentions,  we  included  several  superfluous  features 
in  SOAR,  including  hardware  support  for  storage  reclamation,  pointers  to  registers,  parallel 
nilling,  and  shadow  registers  to  aid  trap  handling.  These  are  architect's  traps  because  they 
increase  design  time  and  potentially  increase  the  cycle  time  without  appreciable  reducing  the 
number  of  cycles.  These  traps  are  baited  with  speedups  for  specific  operations,  and  sprung 
when  real  programs  fail  to  perform  the  optimized  operations. 

We  believe  that  the  key  to  good  performance  is  a  willingness  to  migrate  functionality 
from  one  level  of  abstraction  to  another,  viewing  the  system  as  a  whole  rather  than  as  a  col¬ 
lection  of  layers.  During  the  design  process,  we  moved  functions  freely  up  and  down  the 
implementation  hierarchy  from  software  to  silicon  to  achieve  good  performance  with 
minimal  hardware.  For  example,  instead  of  interpretation,  we  have  chosen  to  burden  the 
software  with  compfling  and  debugging  a  simple  instruction  set  that  can  be  executed 
quickly.  Also,  we  have  replaced  microcoded  instructions  for  infrequent  operations  with 


software  trap  handlers.  Our  system  was  designed  with  an  implementation  technology  in 
mind;  this  is  the  opposite  of  separating  the  architecture  from  the  hardware  implementation. 

We  have  developed  an  algorithm  for  automatic  storage  reclamation.  Generation 
Scavenging,  that  permits  SOAR  to  be  the  first  full-speed  Smalltalk-80  system  without  an 
object  table.  We  have  shown  that,  unlike  many  competing  algorithms.  Generation  Scaveng¬ 


ing  requires  no  hardware  support.  In  addition,  tins  algorithm  reduces  the  time  spent  on 
storage  reclamation  to  3%  of  the  CPU  time.  This  is  three  times  better  than  other 
Smalltalk-80  systems  with  comparable  performance.  Finally,  unlike  traditional 
reference-counting  algorithms,  Generation  Scavenging  can  reclaim  circular  structures  of 
dead  objects.  Automatic  storage  reclamation  is  no  longer  an  important  source  of  overhead. 
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SOAR  represents  a  substantial  improvement  in  cost-performance  over  previous 
Smalltalk-80  systems.  We  recommend  that  anyone  faced  with  the  task  of  building  a  com* 
puter  for  an  exploratory  programming  environment  consider  compilation  to  a  reduced 
instruction  set 


71  Future  Work 


At  this  date  SOAR  has  been  fabricated  and,  running  at  800  ns.,  has  successfully  com¬ 
pleted  all  of  its  diagnostics  [Pen85b].  An  unforeseen  critical  path  to  memory  needed  by  the 
fast  shuffle  hardware  has  increased  its  cycle  time  from  400  ns  to  S10  ns.  Samples  has  ported 
the  Smalltalk-80  system  to  the  SOAR  simulator;  the  system  starts  up  and  displays  its  win¬ 
dows  on  the  screen.  Our  goal  is  to  run  the  Smalltalk-80  system  on  SOAR.  We  will  then 
measure  the  performance  of  die  system  to  find  any  flaws  lurking  in  our  performance  data. 
One  of  the  most  interesting  remaining  tasks  is  to  construct  a  debugger  for  SOAR  that  pro¬ 
vides  all  the  functionality  of  the  current  Smalltalk-80  bytecode  debugger.  A  Smalltalk-80 
system  running  on  SOAR  with  complete,  source-level  debugging  facilities  would  demon¬ 
strate  that  the  primitive  level  of  the  instruction  set  can  be  hidden  from  the  user.  Finally, 
Pendleton  has  proposed  reimplementing  a  stripped-down  SOAR  with  an  optimized  pipeline 
in  a  more  advanced  VLSI  technology  to  yield  a  very  fast  Smalltaik-80  system. 

One  aspect  of  Generation  Scavenging  remains  in  dire  need  of  exploration:  objects  with 
an  intermediate  life  span.  If  promoted  too  soon,  they  waste  disk  space  and  can  degrade  vir¬ 
tual  memory  performance.  If  promoted  too  late,  they  waste  the  CPU  time  needed  to  repeat- 
'  edly  scavenge  them.  Adding  a  third,  middle  generation  is  a  possibility.  Further  research 
will  require  measurements  of  high-performance  Smalltalk-80  systems  with  real  users  to 
obtain  realistic  actuarial  data. 
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Appendix  A 


Detailed  Performance  Evaluation  of  Individual  Features 


A.1.  Introduction 

This  appendix  contains  detailed  evaluations  of  the  effectiveness  of  most  of  the  features 
in  SOAR  and  a  few  proposed  additions  to  SOAR.  The  raw  data,  instruction  mixes,  and  exe¬ 
cution  time  profiles  on  which  these  calculations  are  based  are  in  Appendix  B.  To  guide  you 
through  this  section,  we  have  reprinted  pan  of  the  table  of  contents  in  Table  A.l.  There  are 
two  kinds  of  subroutines  in  SOAR:  subroutines  written  by  Xerox  in  Smalltalk,  and  subrou¬ 
tines  written  by  us  in  assembler  for  runtime  support  Since  these  are  written  in  two  different 
languages,  they  may  have  different  instruction  mixes.  For  this  reason,  our  tables  of  dynamic 
data  have  three  columns:  one  for  the  routines  written  in  Smalltalk  (ST),  one  for  the  routines 
written  in  assembler  (system),  and  one  that  ignores  die  distinction  (both).  Since  system  code 
consumes  two-thirds  of  the  time,  the  averages  (used  in  die  other  chapters)  tend  to  be  dom¬ 
inated  by  the  behavior  of  the  system  code.  If  this  code  were  optimized,  die  numbers  for 
Smalltalk  code  would  become  more  important  for  overall  performance.  For  static  measure¬ 
ments,  the  Smalltalk  routines  dwarf  the  assembler  routines,  and  we  usually  omit  the  assem¬ 
bler  ones. 

AJ.  Runtime  Type  Checking 

Runtime  type  checking  distinguishes  Smalltalk-80  systems  from  those  designed  for 
conventional  languages.  SOAR  supports  this  with  a  tag  bit  for  integers  and  tagged  integer 
arithmetic  and  comparison  instructions. 
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JL2.1.  How  Important  are  tbe  Tagged  Integer  Instructions? 

To  rapport  tagged  integers,  SOAR  includes  tagged  versions  of  the  arithmetic  and  com¬ 
parison  instructions.  To  assess  their  importance,  we  first  measure  their  frequency  of  use, 
then  calculate  the  performance  degradation  that  would  be  caused  by  replacing  them  by 
equivalent  software  instructions. 

Ai.l.L  Tagged  Instruction  Frequency 

Table  A.2  lists  the  frequency  of  each  tagged  integer  instruction  for  several  bench¬ 
marks.  Zero  rows  have  been  omitted.  Table  A.2  above  shows,  for  compiled  Smalltalk-80 
code,  one  out  of  every  8  instructions  executed  exploits  SOAR's  integer  tag-checking 
hardware.  Overall,  die  ratio  is  about  1  out  of  every  1 1  instructions,  interestingly,  tagged 
skips  outnumber  tagged  arithmetic  in  compiled  code. 

Another  way  to  measure  frequency  is  to  count  the  static  number  of  each  kind  of  tagged 
instruction.  Table  A3  shows  that  nearly  1  out  of  every  1 1  instructions  is  a  tagged  integer 
instruction.  This  is  slightly  lower  than  the  dynamic  frequency  of  1  in  8. 

How  often  does  SOAR  detect  an  integer  tag  trap?  As  Table  A.4  shows,  these  traps  are 
quite  rare;  less  than  4  in  1,000  tagged  instructions  trap. 

AJL1JL  Cost  of  Omitting  Tagged  Arithmetic  Instructions 

How  much  slower  would  SOAR  be  without  integer  tag  checking  hardware?  Table  A.S 
shows  the  sequences  that  would  be  needed  without  it  under  tbe  assumption  that  no  compiler 
optimization  is  performed.  (The  feasibility  of  such  optimization  in  the  absence  of  type 
declarations  has  yet  to  be  demonstrated.)  Table  A.6  summarizes  these  data  with  cost  figures. 

The  next  step  is  to  combine  this  cost  data  with  the  frequency  data.  Table  A.7  lists  the 
time  cost  of  omitting  each  type  of  tagged  instruction  from  SOAR.  The  benchmarks  would 
take  from  20%  to  32%  more  time  without  integer  tag  checking  hardware  in  SOAR. 


Table  A3: 

Frequency  of  tagged  arithmetic  instructions,  Part  1. 

ST 

system 

both 

test3plus4 

all  instt 

65.14% 

34.86% 

100% 

add 

33.07% 

0.00% 

2134% 

trap! 

0.00% 

6.17% 

2.15% 

loadc 

335% 

0.06% 

2.20% 

total 

36.42% 

635% 

25.89% 

testA  ctivarionRerurn 

all  insts 

9731% 

2.79% 

100% 

sub 

9.46% 

0.00% 

9.20% 

skip 

9.46% 

0.00% 

9.20% 

loadc 

9.46% 

0.00% 

9.20% 

total 

28.40% 

0.00% 

27.61% 

all  insts 

41.06% 

58.94% 

100% 

add 

1.19% 

1.19% 

1.19% 

sab 

0.34% 

1.73% 

1.15% 

sll 

0.00% 

039% 

0.35% 

skip 

236% 

131% 

1.70% 

trap! 

0.00% 

2.49% 

1.47% 

load 

0.00% 

0.81% 

0.81% 

loadc 

733% 

0.10% 

3.03% 

total 

11.03% 

8.79% 

9.71% 

ttsiCompiier 

all  insts 

33.42% 

6638% 

100% 

add 

136% 

0.89% 

1.01% 

sub 

0.45% 

1.17% 

0.93% 

sll 

0.00% 

039% 

0.19%. 

skip 

1.94% 

0.87% 

1.23% 

tnpl 

0.00% 

136% 

1.04% 

load 

0.00% 

1.02% 

0.68% 

loadc 

730% 

036% 

2.60% 

total 


10.92% 


6.07% 


7.69% 


Table  AJ: 

Frequency  of  tagged  arithmetic  instructions,  Part  2. 

ST 

system 

bod)  ! 

ustDecompUer 

all  insts 

32.19% 

67.81% 

100%  | 

add 

1.83% 

1.00% 

sub 

0.47% 

1.17% 

0.93% 

and 

0.09% 

0.00% 

0.03% 

all 

0.00% 

0.10% 

0.07% 

sn 

0.00% 

0.16% 

0.11% 

skip 

2.52% 

0.62% 

1.23% 

trapl 

0.00% 

136% 

1.06% 

load 

0.00% 

1.12% 

0.76% 

loadc 

7.21% 

038% 

2.51% 

total 

12.08% 

6.00% 

7.95% 

testPrintDefinition 

all  insts 

38.01% 

61.99% 

100% 

add 

2.26% 

1.37% 

1.71% 

sob 

0.08% 

2.69% 

1.70% 

skip 

431% 

0.02% 

1.65% 

trapl 

0.00% 

3.68% 

2.28% 

load 

0.00% 

236% 

1.59% 

loadc 

7.97% 

0.11% 

3.10% 

total 

14.65% 

10.44% 

12.04% 

testPrintHierarchy 

all  insts 

2635% 

73.75% 

100% 

add 

2.10% 

0.26% 

0.73% 

sub 

033% 

0.84% 

0.68% 

skip 

231% 

0.05% 

0.70% 

trapl 

0.00% 

2.17% 

1.60% 

load 

0.00% 

1.45% 

1.07% 

loadc 

7.62% 

0.19% 

2.14% 

total 

12.46% 

4.98% 

6.94% 

Average  of  macro-benchmarks 

all  insts 

34.19% 

65.81% 

100% 

add 

1.73% 

0.94% 

1.18% 

sub 

0.31% 

132% 

1.08% 

and 

0.02% 

0.00% 

0.01% 

sll 

0.00% 

030% 

0.12% 

sra 

0.00% 

0.03% 

0.02% 

skip 

2.71% 

0.57% 

1.30% 

trapl 

0.00% 

2.29% 

1.49% 

load 

0.00% 

1.39% 

0.98% 

loadc 

7.47% 

3.19% 

2.68% 

total 

12.23% 

736% 

8.87% 

'1, 


Table  AJ:  Write  around  for  tagged  instructions.  Part  1. 

add  A  sab _ 

%or  a,  b,  t;  (omit  for  immediate) 

%»bp  la  t.  1 «  31 
jump  error 
Stadd/Stwb  a,  b,  c 
%xor  a.  b,  t 
Stand  t,  1  «31,t 

Stabp  na  t,  0;  (are  aigna  equal?) 

jump  ok:  (do!  ia  OK) 

Stxor  a,  c,  t 
Stand  t. 1 «  31,  t 

Stakip  eq  t,  0;  (overflow?) 


aad  A  or  A  xor 

Stor 

Stakip  bn  a,  1 «  31 
jump  floor 
Standi StotfStxor 

Jll _ 

Stakip  Iru  a,  1 «  31 
jump  error 
Stall  a.  b, 

Stxor  a,  b,  t 
Stand  t. 1 «  31.  t 
Stakip  eq  t,  0; 
jump  error 

_Kl _ 

Stakip  bu  a,  1 «  31 


Stakip  for  a,  1 «  31 


Stan  a.  b 

Stakip  k  a.  1 «  30 
Storb,  1  «  30,  b 


(overflow?) 


a,b,t;(niooly) 


■ 

I 


(overflow?) 


»V-Y »>'.  vV-’.  O  ■- V.  V>V' V-‘  • 


Table  A.7: 

Time  cost  of  omitting  tagged  integer  instructions.  Part  1. 

ST 

system 

both 

test3plus4 

aU  cycles 

5931  % 

40.43% 

100% 

add 

1 50.06%-300. 1 2% 

0.00% 

89.40%- 178.80% 

trapl 

0.00% 

13.26%-22.11% 

5.36%-8.94% 

loadc 

6.06% 

0.10% 

3.65%-3.65% 

total 

130.06%-330.12% 

13.36%-22.2l% 

94.76%- 187.74% 

Performance  relative  to  full  SOAR  (<100%  is  slower) 

51%-35% 

testActivationReturn 

all  cycles 

95.91% 

4.09% 

100% 

sub 

35.30%-70.65% 

0.00% 

33.87%-67.75% 

skip 

21.19%-35.31% 

0.00% 

20.32%-33.87% 

loadc 

14.13% 

0.00% 

1355% 

total 

70.62%- 120.08% 

0.00% 

67.74%-l  15.17% 

Performance  relative  to  full  SOAR  (<100%  is  slower) 

60%-46% 

testC  lassOrganizer 

all  cycles 

4256% 

57.44% 

100% 

add 

3.99%-7.98% 

4J27%-854% 

4.15%-8.30% 

sub 

1.13%-2.26% 

6.19%-1 2.38% 

4.04%-8.08% 

sll 

0.00% 

2.59% 

1.49% 

skip 

4.61%-7.68% 

2.80%-4.67% 

3.57%-5.95% 

trapl 

0.00% 

5.40%-8.98% 

3.10%-5.16% 

load 

0.00% 

1.98%-2.98% 

1.14%-1.71% 

loadc 

9.80% 

0.14% 

4.25%-4.25% 

total 

19.54%-27.72% 

23.38%-40.20% 

21.74%-34.95% 

Performance  relative  to  full  SOAR  (<100%  is  slower) 

82%-74% 

tesrCompiler 

all  cycles 

34.07% 

65.93% 

100% 

add 

4.18%-8.35% 

3.05%-6.1 1% 

3.44%-6.87% 

sub 

152%-3.05% 

4.06%-8.12% 

3.20%-6.39% 

and 

0.03%-0.03% 

0.00%-0.00% 

0.0l%-0.01% 

sll 

0.00% 

1.17% 

0.77% 

sra 

0.01% 

skip 

3.90%-6.49% 

1.82%-3.02% 

2.52%-4.20% 

trapl 

0.00% 

3.22%-5.37% 

2.12%-354% 

load 

0.00% 

1.41%-2.12% 

0.93%- 1.40% 

loadc 

9.77% 

0.35% 

3.56%-356% 

total _ 19.35%-27.65%  15.10%-26.28% 

Performance  relative  to  full  SOAR  (<100%  is  slower) 


16.55%-26.74% 

86%-79% 


Tabic  A.7:  Time  cost  of  omitting  tagged  integer  instructions.  Part  2. 


1 

ST 

system 

both 

testDtcompiler 

all  cycles 

32.38% 

67.62% 

100% 

add 

6.29%- 12.58% 

3.42%-6.85% 

4.35%-8.70% 

sob 

135%-3.09% 

4.00%-8.00% 

3.20%-6.41% 

and 

0.09%-0.15% 

0.00% 

0.03%-0.05% 

sU 

0.00% 

0.40% 

027% 

sra 

0.00% 

0.43% 

029% 

skip 

5.13%-8.52% 

1.29%-2-13% 

233%-421% 

trapl 

0.00% 

3.22%-5.37% 

2.18%-3.63% 

load 

0.00% 

134%- 2. 29% 

1.04%- 135% 

loadc 

9.82% 

0.40% 

3.44%-3.44% 

total 

22.86%-34.16% 

14.68%-25.88% 

17.34%-2836% 

|  Performance  relative  to  full  SOAR  (<100%  is  slower) 

85%-78% 

ttstPrintDefinuion 

all  cycles 

38.09% 

61.91% 

100% 

add 

830%-16.61% 

5.01%- 10.02% 

6.26%-l233% 

sub 

0.25%-0.50% 

9.89%- 19.78% 

6.22%- 12.44% 

skip 

9.45%- 15.78% 

0.03%-0.05% 

3.62%-6.04% 

trapl 

0.00% 

8.09%- 13.49% 

5.01%-8.35% 

load 

0.00% 

3.78%-5.65% 

2.34%-330% 

loadc 

11.66% 

0.16% 

435%-435% 

total 

29.69%-44.55% 

26.95%-49.16% 

27.99%-47.40% 

|  Performance  relative  to  full  SOAR  (<100%  is  slower) 

78%-68% 

all  cycles 


testPrintHierarchy 


74.10% 


add 

7.42%- 14.85% 

0.89%- 1.78% 

2.58%-5.16% 

sub 

0.82%- 1.65% 

2.95%-5.89% 

2.40%-4.79% 

and 

0.04% 

0.00% 

0.01% 

sU 

0.00% 

0.03% 

.  0.02% 

skip 

5.37%-8.96% 

0.12%-0.20% 

1.48%-2.47% 

trapl 

0.00% 

436%-7.60% 

3.38%-5.63% 

load 

0.00% 

2.04%-3.06% 

1 .5 1  %-2.27% 

loadc 

10.89% 

0.27% 

3.02%-3.02% 

total 

24.52%-36.34% 

10.84%- 18.81% 

14.38%-23.36% 

Performance  relative  to  full  SOAR  (<100%  is  slower) 

87%-81% 

Table  A.7:  Time  cost  of  omitting  tagged  integer  instructions.  Part  3. 


ill  cycles 

add 

tub 

and 

all 

sra 

skip 

trapl 

load 

loadc 


ST  system 

average  of  macro-benchmarks 

34.60% _ 65.40% 

6.04%- 12.07%  3.33%-6.65% 

1.05%-2.11%  5 .42%- 10.84% 

0.03%-0.04%  0.00% 

0.00%  0.84% 

0.00%  0.09% 

5.69%-9.49%  J.21%-2.01% 

0%  4.9%-8.16% 

0.00%  2.15%-3.22% 

1039%  0.26% 


total _ 23.19%-34.09%  18.19%-32.Q8% 

Performance  relative  to  full  SOAR  {<100%  is  slower) 


100% 

4.15%-831% 

3.81%-7.62% 

0.0l%-0.02% 

031% 

0.06% 

2.74%-437% 

3.16%-S26% 

1.39%-2.09% 

3.76% 

19.61  %-3231% 
84%-76% 


Of  course,  eliminating  tag  checking  hardware  from  SOAR  would  also  incur  a  space 
cost  for  die  extra  checking  instructions.  Table  A.8  combines  die  static  cost  data  with  die 
static  frequency  data  to  compute  die  code  expansion  resulting  from  omitting  data  tag  check¬ 
ing  hardware  in  SOAR.  Again,  we  can  ignore  die  system  code  because  it  is  so  small.  The 
dam  show  that  38%  more  instructions  would  be  needed  —  about  15%  of  the  total  image. 


Table  AJ:  Static  Cost  of  Omitting  Tagged  Arith  Insts  in  System. 

(3502  instruction  words) 

(493  data  words) 

(3995  total  words  in  sys) 

(168381  SOAR  words  of  compiled  code  &  literals) 

(4,600  Smalltalk  subroutines) 

(430,000  SOAR  words  total  image)  _ 


immediate?  cost  %code  %code  +  data 


add  yes  7462 

add  no  1  11320  i 


0.07% 

0.23% 

0.00% 

0.04% 


0.03% 

0.09% 

0.00% 

0.02% 
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By  moving  the  tag  check  into  hardware  we  have  increased  the  cost  for  a  tag  exception. 
SOAR  most  take  a  trap  to  handle  one.  The  data  show  that  only  0.39%  of  tagged  instructions 
trap,  and  that  only  12.5%  of  the  instructions  are  tagged.  Thus,  a  tag  trap  occurs  once  for 
every  .2000  instructions.  Since  the  tag  trap  handler  prologue  is  about  25  instructions  long, 
this  represent  a  time  cost  of  about  1.25%. 

To  summarize,  SOAR  without  hardware  support  for  integer  tag  checking  and  with  the 


same  code  generation  strategy  would  run  24%  slower  and  require  about  150  KB  more 


memory. 

A .2.2.  Evaluating  the  Impact  of  Adding  a  Compare-and-Branch  Instruction 

Instead  of  condition  codes,  SOAR  uses  conditional  skip  instructions.  This  simplifies 
handling  comparisons  of  data  that  are  not  integers.  The  tag  trap  handler  need  not  set  condi¬ 
tion  codes,  but  can  merely  return  to  the  appropriate  location.  As  a  result,  a  conditional  jump 
in  SOAR  takes  two  cycles:  one  for  the  skip  instruction  and  another  for  the  jump.  This  is  as 
fast  as  it  can  be  without  an  additional  adder  to  compute  jump  addresses.  If  we  had  such  a 
device  how  much  faster  could  SOAR  run?  To  bound  the  number  of  times  a  conditional 


jump  instruction  would  be  used  we  can  count  skips.  We  can  find  a  more  accurate  figure  by 
counting  only  those  skips  that  skip  over  unconditional  jumps.  Table  A.9  present  these  data. 
The  table  shows  that  the  most  that  could  be  hoped  for  is  an  8%  improvement.  Counting  only 


those  skips  that  follow  jumps  results  in  a  time  savings  of  2.6%.  The  large  disparity  implies 
that  there  are  many  places  where  the  conditionally  executed  code  is  only  a  single  instruction. 

For  a  static  analysis,  we  counted  the  number  of  conditional  jump  sequences  produced 
by  die  compiler  (Table  A.10).  The  table  shows  that  little  space  would  be  saved. 
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Table  A.9:  Upper  bound  on  speedup  with  compare-and-branch,  Part  1. 

: 

ST 

system 

both 

testClassOrganistr 

'  instructions 

41.06% 

58.94% 

100% 

i  cycles 

42.56% 

57.44% 

100% 

un tagged  skip’s  per  instruction 

1.57% 

12.39% 

7.95% 

tagged  skip’s  per  instruction 

221% 

1.30% 

1.70% 

;  total  skip’s  per  instruction 

3.84% 

13.69% 

9.65% 

skip-jumps  per  instruction 

1.06% 

5.49% 

3.67% 

untagged  skip's  per  cycle 

1.06% 

8.91% 

5.57% 

tagged  skip’s  per  cycle 

1.53% 

0.93% 

1.19% 

total  skip’s  per  cycle 

2.60% 

9.84% 

6.76% 

skip-jumps  per  cycle 

0.85% 

4.43% 

2.95% 

testCompiltr 

1  instructions 

33.42% 

66.58% 

100% 

;  cycles 

34.07% 

65.93% 

100% 

untagged  skip's  per  instruction 

1.50% 

15.57% 

10.87% 

i  tagged  skip’s  per  instruction 

1.93% 

0.88% 

1.23% 

total  skip’s  per  instruction 

3.44% 

16.44% 

12.10% 

!  skip-jumps  per  instruction 

1.37% 

5.78% 

4.30% 

untagged  skip’s  per  cycle 

1.01% 

10.74% 

7.42% 

i  tagged  skip’s  per  cycle 

1.30% 

0.60% 

0.84% 

;  total  skip’s  per  cycle 

2.30% 

11.34% 

8.26% 

skip-jumps  per  cycle 

0.92% 

3.98% 

2.94% 

testDecompiler 

instructions 

32.19% 

67.81% 

100% 

cycles 

32.38% 

67.62% 

100% 

untagged  skip’s  per  instruction 

6.72% 

17.56% 

12.14% 

tagged  skip’s  per  instruction 

2.51% 

0.62% 

1.23% 

total  skip’s  per  instruction 

3.23% 

18.18% 

13.37% 

skip-jumps  per  instruction 

1.29% 

4.63% 

3.56% 

un  tagged  skip's  per  cycle 

0.49% 

12.07% 

8.32% 

tagged  skip's  per  cycle 

1.71% 

0.43% 

0.84% 

total  skip’s  per  cycle 

2.20% 

12.50% 

9.16% 

skip-jumps  per  cycle 

0.88% 

3.18% 

2.44% 

VF  S,»W»«-"W" 


K's'A 

a# 


>v.v 

>>:•: 

£\i< 


m 


»;<j 


.  -4 

!^v 

/*•' 


: TT 


untagged  skip  s  per  cycle 
tagged  skip’s  per  cycle 
total  skip's  per  cycle 


2.79% 


test  Print  Hierarchy 


26 

23 


j  nntagged  skip’s  per  cycle 

0.86% 

1 

:  tagged  skip's  per  cycle 

1.79% 

|  total  skip’s  per  cycle 

2.65% 

I 

skip- jumps  per  cycle 

1.19% 

average  of  macro-benchmarks 


% 

% 


5.79% 

2.13% 


0.49% 


100.00% 

100.00% 


nntagged  skip  s  per  cycle 
tagged  skip’s  per  cycle 
total  skip's  per  cycle 
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AJJ.  Evaluating  Two* Tone  Instructions 

SOAR  has  two  inodes  of  execution:  tagged  and  untagged.  Rather  than  putting  a  mode 
bit  in  die  PSW  and  spending  a  cycle  to  switch  modes  when  needed,  we  put  a  mode  bit  in 
each  instruction.  Table  A.1 1  shows  how  much  slower  SOAR  would  run  if  it  took  extra  time 
to  switch  inodes.  The  table  shows  that  SOAR  would  be  16%  slower  without  two-tone 
instructions. 

To  compute  the  code  expansion,  we  instrumented  the  compiler.  Table  A.  12  analyzes 
these  data.  The  table  shows  that  the  image  would  be  19%  larger  without  two-tone  instruc¬ 
tions. 


Table  A.11:  Projet 

cted  time  cost  of  manipulating  PSW  mode  bit. 

ST 

system 

both 

aestClassOrganizer 

cycles 

42.56% 

57.44% 

100% 

cost  of  mode-setting 

!  instructions  17.86% 

19.30% 

18.69% 

testCompiler 

cycles 

34.07% 

65.93% 

100% 

cost  of  mode-setting 

:  instructions  18.52% 

12.68% 

14.67% 

testDecompiler 

cycles 

32.38% 

67.62% 

100% 

cost  of  mode-setting 

;  instructions  19.87% 

11.92% 

14.50% 

testPrintDefinition 

cycles 

38.09% 

61.91% 

100% 

cost  of  mode-setting 

[  instructions  20.53% 

20.35% 

20.42% 

testPrintHierarchy 

cycles 

25.90% 

74.10% 

100% 

cost  of  mode-setting 

:  instructions  21.74% 

9.93% 

12.99% 

_ average  of  macro-benchmarks _ 

cycles  34.60%  65.40%  100.00% 

cost  of  mode-setting  instructions  19.70%  14.84%  16.25% 

_ Table  A.12:  Space  cost  of  mode  bit  in  PSW. 

number  of  extra  instructions  to  change  PSW  mode  bit  70759 

image  size _  1,500  kB 

relative  cost  of  PSW  mode  bit  1 8.87% 
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AJ.4.  How  Important  Arc  Tagged  Immediate*? 

SOAR’s  tagged  immediate  format  crams  tagged  values  such  as  ail,  true,  and  false  into 
a  twelve-bit  immediate  field.  Without  this  feature,  a  two-cycle  load  instruction  would  be 
needed  to  get  a  tagged  value.  Table  A.13  analyzes  die  performance  impact  of  this  feature. 
For  each  benchmark,  it  gives  the  breakdown  of  cycles  spent  in  Smalltalk  vs.  system  code, 
then  proceeds  to  give  the  percentage  of  immediates  used  requiring  the  tagged  format,  and 
finally,  the  time  cost  of  omitting  this  feature.  These  data  suggest  that  SOAR  would  be  10% 
slower  without  this  feature. 

To  analyze  die  impact  of  tagged  immediates  on  die  size  of  the  compiled  image,  we 
instrumented  our  compiler  (Table  A.  14).  As  expected,  non-negative  integers  dominate 
immediate  values.  Pointer  immediates  are  also  frequent  Interestingly,  boolean  masks  (all 
zeroes  with  a  one  in  one  of  the  top  four  bits,  or  tag  values)  provide  a  use  for  tagged  immedi¬ 
ates  more  often  than  pointers. 

The  next  step  is  to  count  die  number  of  immediates  that  would  be  unrepresentable 
without  tagged  immediates  and  determine  die  amount  of  further  expansion  in  the  image 
(Table  A.15).  Tagged  immediates  don’t  save  much  space;  the  image  would  only  be  1.2% 
larger  without  them. 

A3.  Interpretation 

This  section  concerns  features  of  SOAR’s  instruction  set  and  trap  system. 

A3.1.  Evaluating  SOAR’s  Byte  Facilities 

We  perform  two  comparisons:  the  speedup  possible  with  load/store  byte  instructions, 
and  die  slowdown  had  we  not  provided  the  insen  and  extract  instructions.  Table  A.16  gives 
the  important  instruction  sequences:  LoadByte  and  storeByte  are  slightly  faster  than  extract 
and  insert,  which  in  turn  are  much  faster  than  relying  on  one  bit  shifts. 


tagged  imms/all  imms 
ed  imm  cost/all  cycles 


•jTT 


3238%  < 

tagged  imm«/all  imm* 

12.74%  1 

tagged  imm  cost/all  cycles 

6.12%  1 

testPrintD 


10.78%  ! 


cycles 

38.09% 

61.91% 

100% 

tagged  imms/all  imms 

12.63% 

10.29% 

10.88% 

tagged  imm  cost/all  cycles 

5.90% 

8.75% 

7.66% 

test  Print  Hierarchy 


tagged  imms/all  imms 
ed  imm  cost/all  cycles 


average  of  macro-benchmarks 


65.40% 


100.00% 


tagged  imm&'all  imms 
ed  imm  cost/all  cycles 


m 


159 


1  Tabic  A.14:  Raw  data  for  static  analysis  of  Lagged  immediates. 

imxnediar  value 

count 

OK  in 
SOAR 

OK  w/o  tagged  ; 
immediates 

non-negative  integers 

35106 

yes 

yes 

negative  31-bit  integers 

7968 

yes 

yes* 

boolean  masks 

2984 

yes 

no 

pointers 

2433 

yes 

no 

invalid!  pointers 

8507 

no 

no 

invalidt  integers 

868 

no 

yes*  i 

total  SOAR  image  site 

1500  kfi 

1 

[  Table  A.15:  Impact  of  eliminating  tagged  immediates. 

cost  for  pointers 

5417  immediates 

savings  for  integers 

868  immediates 

net  cost 

4549  immediates 

relative  cost 

121% 

Table  A.16:  Codes  sequences  for  byte  operations,  Part  1. 

(Byte  0  is  least  significant  byte,  byte  3  is  most  significant.) 

i 

Loading  a  byte  from  memory 

•  loadByte 

load  byte  instruction  (addition  to  SOAR) 

( base  )off set  +  bytcNo,  dest 

.  time 

2  cycles 

load 

extract 

extract  byte  instruction  (current  SOAR) 

(base  )off set,  dest 
dest,  byteNo,  dest 

time 

3  cycles 

no  special  instructions  (simplification  to  SOAR ) 
load  (base)offset,  dest 

:  srl  dest.  dest  (0  to  24  of  these) 

load  pcRel(mask),  maskReg  (omit  for  byte  3) 

and  dest,  maskReg,  dest  (omit  for  byte  3) 

mask:  Oxff 

'  byte  0  time 
byte  1  time 
:  byte  2  time 
byte  3  time 
|  avg.  time 

5  cycles 

13  cycles 

21  cycles 

26  cycles 

16  cycles 

*  la  outer  to  bo  co—rvtivr. wc  mb  ihet  tht  negative  imtnodiewi  coaid  be  reptetented  without  leg  fed  iamadieiei 
by  either  chaagng  the  opcode  to  nbtnct  tewed  of  add  or,  for  off  ecu.  by  oim|  the  foil  32-bit  repreecatation.  We  further  ai- 
fame  that  die  Wtegerr  which  are  too  big  for  our  cunent  echeme  would  fit  m  four  more  bin. 

♦  Them  valaae  do  aot  fit  in  SOAJt's  tagged  immediate  formal. 
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Table  A.16:  Codes  sequences  for  byte  operations,  Part  2. 
(Byte  0  is  least  significant  byte,  byte  3  is  most  significant) 

Storing  a  byte  in  memory 

store  byte  instruction  (addition  to  SOAR ) 

storeByte 

source,  (base)offset  +  byteNo 

time 

2  cycles 

insert  byte  instruction  (current  SOAR) 

load 

(base  )off set.  dest 

load 

(base)offset,  rl 

load 

pcRel(mask),  maskReg 

and 

rl,  maskReg,  rl 

insert 

source,  byteNo,  r2 

or 

rl,  r2.  rl 

store 

rl,  (basejofftet 

tune 

9  cycles 

no 

special  instructions  (simplification  of  SOAR) 

load 

(base )off set  rl 

load 

pcRel(mask),  maskReg 

and 

rl,  maskReg,  rl 

sll 

source,  source 

xor 

maskReg, -1,  maskReg  (omit  for  byte  3) 

and 

source.  maskReg,  source  (omit  for  byte  3) 

or 

rl,  source,  rl 

store 

rl,  (base)ofifset 

byte  0  rime 

10  cycles 

byte  1  time 

18  cycles 

byte  2  rime 

26  cycles 

byte  3  time 

32  cycles 

avg.  time 

22  cycles 

Next  in  Table  A.  17  we  gather  frequency  data  on  insen  and  extract  instructions,  and 
multiply  by  the  various  costs  to  evaluate  the  performance  impact  of  these  other  two  schemes. 
As  shown  in  die  last  section  of  Table  A.  17,  the  average  time  savings  for  adding  load/store 
byte  instructions  would  be  7%,  while  the  average  time  penalty  for  taking  away  the  byte 
insen/extract  instructions  would  be  33%.  Byte  insert/extract  instructions  seem  to  be  a  good 
compromise  between  functionality  and  efficiency. 


Table  A.  17:  Dynamic  analysis  of  byte  operations,  Part  1. 


testC  lassOrganizer 


n  41.06%  58.94%  100% 


steps 
,  cycles 


I  insert  per  inst 
I  extract  per  inst 
!  insert  +  extract 


insert  per  cycle 
|  extract  per  cycle 
'  insert  -*■  extract  per  cycle 


'  stoic  byte  savings 
!  load  byte  savings 
:  load  &  store  byte  savings 


i  min  insert  omission  cost 
:  min  extract  omission  cost 
;  min  insert/extract  omission  cost 

-  avg  insert  omission  cost 
avg  extract  omission  cost 

■  avg  insert/extract  omission  cost 

i  max  insert  omission  cost 
‘  max  extract  omission  cost 

■  max  insert/extract  omission  cost 


42.56% 


58.94% 

57.44% 


0.97% 

0.57% 

3.54% 

2.09% 

4.51% 

2.66% 

0.70% 

0.40% 

234% 

1.46% 

3.24% 

1.86% 

testCompiler 


33.42% 

34.07% 


66.58% 

65.93% 


store  byte  savings 
load  byte  savings 
load  A  store  byte  savings 


min  insert  omission  cost 
min  extract  omission  cost 
min  insen/extract  omission  cost 

avg  insen  omission  cost 
avg  extract  omission  cost 
avg  insert/extract  omission  cost 

max  insen  omission  cost 
max  extract  omission  cost 
max  insert/extract  omission  cost 


3.61% 

1.81% 

5.41% 


9.04% 

5.19% 

33.07% 

18.99% 

42.11% 

24.19% 

16.00% 

9.19% 

5850% 

33.60% 

7450% 

42.79% 

2.38% 

1.19% 

3.57% 


6.70% 

4.41% 

2351% 

15.50% 

30.20% 

19.91% 

11.85% 

7.81% 

4159% 

27.42% 

53.43% 

35.23% 
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Table  A.17:  Dynamic  analysis  of  byte  operations.  Part  2. 

testDecompiler 

steps 

32.19% 

67.81% 

100% 

cycles 

32.38% 

67.62% 

100% 

insert  per  iast 

0 

1.12% 

0.76% 

extract  per  inst 

0 

2.77% 

1.88% 

insert  ■*■  extract  per  inst 

0 

3.89% 

2.64% 

insert  per  cycle 

0 

0.77% 

0.52% 

extract  per  cycle 

0 

1.91% 

1.29% 

insert  +  extract  per  cycle 

0 

2.67% 

1.81% 

store  byte  savings 

0 

5.37% 

3.63% 

load  byte  savings 

0 

1.91% 

1.29% 

load  &  store  byte  savings 

0 

7.28% 

4.92% 

min  insert  omission  cost 

0 

0.77% 

032%  : 

min  extract  omission  cost 

0 

3.81% 

238%  ; 

min  insert/extract  omission  cost 

0 

4.58% 

3.10%  ; 

avg  insert  omission  cost 

0 

9.97% 

6.74% 

avg  extract  omission  cost 

0 

24.78% 

16.76% 

avg  insert/extract  omission  cost 

0 

34.75% 

23.50% 

max  insert  omission  cost 

0 

17.65% 

11.93% 

max  extract  omission  cost 

0 

43.84% 

29.65% 

max  insert/extract  omission  cost 

0 

61.49% 

41.58% 

testPrintDefinition 

steps 

38.01% 

61.99% 

100% 

cycles 

38.09% 

61.91% 

100% 

insert  per  inst 

0 

2.23% 

1.38% 

extract  per  inst 

0 

6.03% 

3.74% 

insert  +  extract  per  inst 

0 

8.26% 

5.12% 

insert  per  cycle 

0 

1.63% 

1.01% 

extract  per  cycle 

0 

4.42% 

2.74% 

insen  +  extract  per  cycle 

0 

6.06% 

3.75% 

store  byte  savings 

0 

11.44% 

7.08% 

load  byte  savings 

0 

4.42% 

2.74% 

load  A  store  byte  savings 

0 

15.86% 

9.82% 

min  insen  omission  cost 

0 

1.63% 

1.01% 

min  extract  omission  cost 

0 

8.85% 

5.48% 

min  insert/extract  omission  cost 

0 

10.48% 

6.49% 

avg  insen  omission  cost 

0 

21.24% 

13.15% 

avg  extract  omission  cost 

0 

57.51% 

35.60% 

avg  insert/extract  omission  cost 

0 

78.75% 

48.75% 

max  insen  omission  cost 

0 

37.57% 

23.26% 

max  extract  omission  cost 

0 

101.75% 

62.99% 

max  insert/extract  omission  cost 

0 

139.32% 

86.25% 
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Table  A.17:  Dynamic  analysis  of  bvte  operations,  Part  3. 


test  Print H  ierarchx 


s  26.25%  73.75%  1009 


steps 

cycles 


insert  per  inst 
ex  net  per  inst 
insert  +  extract  per  inst 


insert  per  cycle 
extract  per  cycle 
insert  +  extract  per  cycle 


25.90% 


73.75% 

74.10% 


2.84% 

2.09% 

4.20% 

3.10% 

7.04% 

5.19% 

1.99% 

2.95% 

4.94% 


mm  insert  omission  cost 
min  extract  omission  cost 
min  insert/extract  omission  cost 

avg  insert  omission  cost 
avg  extract  omission  cost 
avg  insert/extract  omission  cost 

max  insert  omission  cost 
max  extract  omission  cost 
max  insert/extract  omission  cost 


45.77% 

67.76% 

11334% 


10.32% 


% 


28.38% 

4735% 

33.92% 

50.21% 

84.13% 


average  of  macro-benchmarks 


34. 


insert  per  inst 
extract  per  inst 
insert  -t-  extract 


insert  per  cycle 
extract  per  cycle 
insert  +  extract  per  cycle 


store  byte  savings 
load  byte  savings 
load  &  store  byte  savings 


mm  insert  omission  cost 
min  extract  omission  cost 
min  insert/extract  omission  cost 

avg  insert  omission  cost 
avg  extract  omission  cost 
avg  insert/extract  omission  cost 

max  extract  omission  cost 
max  insen  omission  cost 
max  insert/extract  omission  cost 


100.00% 

100.00% 


1.06%  i 
231%  i 


1.77% 

7.02% 


50.00% 

32.78% 

62.69% 

40.77% 

25.77% 

17.22% 

88.46% 

58.00% 

Table  A.18:  Loadc  Time  Analysis,  Part  1. 
(All  numbers  are  in  percents.) 


benchmark 

Smalltalk 

system 

both 

testActrvationRemrn 

steps 

97.21% 

2.79% 

100% 

cycles 

95.91% 

4.09% 

100% 

loadc  per  inst 

9.47% 

0.01% 

9.20% 

loadc  per  cycle 

7.06% 

0.01% 

6.77% 

loadc  traps  per  loadc 

0% 

0% 

0% 

cost  of  omitting  loadc 

0% 

0% 

0% 

testClassOrganizer 

steps 

41.06% 

58.94% 

100% 

cycles 

42.56% 

57.44% 

100% 

loadc  per  inst 

7.24% 

0.10% 

3.03% 

loadc  per  cycle 

4.90% 

0.07% 

2.13% 

loadc  traps  per  loadc 

25.39% 

0% 

24.90% 

cost  of  omitting  loadc 

2.49% 

0% 

1.06% 

testCompiler 

steps 

33.42% 

66.58% 

100% 

cycles 

34.07% 

65.93% 

100% 

loadc  per  inst 
loadc  per  cycle 
loadc  traps  per  loadc 
cost  of  omitdne  loadc 


testDecompiler 


steps 

cycles 


67.81% 

67.62% 


loadc  per  inst 

7.20% 

0.29% 

loadc  per  cycle 

4.91% 

0.20% 

loadc  traps  per  loadc 

17.06% 

0.16%  l: 

cost  of  omitting  loadc 

1.67% 

0.00%  1 

testPrintDefinition 


loadc  per  inst 
loadc  per  cycle 
loadc  traps  per  loadc 
cost  of  omitdne  loadc 


-  61.99% 

100% 

p  61.91% 

100% 

testPrintHierarchy 

steps 

26.25%  73.75% 

100% 

cycles 

25.90%  74.10% 

100% 

loadc  per  inst 
loadc  per  cycle 
loadc  traps  per  loadc 
cost  of  omitting  loadc 


100% 

100% 


0.03% 

0.02% 


testDecompiler 


!  instructions  32.19%  67.81%  100% 

I  cycles  32.38%  67.62%  100% 


I  inl/outl  uses  per  inst  0%  0.04%  0.03% 

cost  of  omittine  inl/outl%  0%  0.03%  0.02% 


• _  testPrintDefinirion 


instructions 

1  cycles _ , 

inl/outl  uses  per  inst  0%  0%  0% 

cost  of  omittine  inl/outl  %  0%  0%  0% 


_ testPrintHierarchy 


instructions  26.23%  73.73%  100% 

.  cycles  25.90%  74.10%  100% 


:  inl/outl  uses  per  inst  0%  0.00%  0.00% 

cost  of  omitting  inl/outl  %  0%  0.00%  0.00% 


61.99% 

100% 

»  61.91% 

100% 

AJ.i  Evaluating  SOAR’*  Conditional  Trap  Instruction 

Conditional  trap  instructions  can  save  one  cycle  for  a  comparison  whose  outcome  can 
be  predicted.  Our  SOAR  software  exploits  the  trap  instruction  to  verify  the  in-line  pro¬ 
cedure  call  cache,  to  check  the  tags  of  return  values,  and  to  test  the  types  of  arguments  to 
primitive  routines.  Table  A.22  shows  die  sequence  that  would  be  required  without  this 
instruction.  Table  A.23  shows  the  trap  instruction  dynamic  frequency,  and  the  time  cost  for 
omitting  this  feature  from  SOAR.  Since  the  overhead  is  one  cycle  per  trap  instruction,  the 
difference  between  die  two  numbers  arises  because  the  average  instruction  duration  is  1.5 
cycles.  The  data  show  that  SOAR  would  be  4%  slower  without  this  feature. 

To  analyze  die  impact  of  eliminating  trap  instructions  on  the  size  of  the  compiled 
image,  we  instrumented  our  compiler  to  count  trap  instructions.  Then  assuming  that  each 
such  instruction  would  become  two  instructions  —  a  skip  followed  by  a  call  —  we  can  cal¬ 
culate  the  total  impact  (Table  A.24).  Trap  instructions  improve  image  size  even  less  than 
execution  speed,  and  our  image  would  only  be  2%  larger  without  them. 

AJ.7.  One-Cycle  Traps 

At  one  point  in  the  design  of  SOAR,  we  decided  to  extend  the  trap  operation  rather 
than  lengthen  the  cycle  time  [Pen85b].  This  resulted  in  two-cycle  traps  instead  of  one-cycle 
naps.  How  many  cycles  did  this  decision  cost  us?  Table  A.23  presents  our  data.  The  result 
of  adding  the  extra  cycle  to  the  trap  operation  was  to  require  fewer  than  one  percent  more 
cycles.  This  was  a  good  decision. 

Table  A^2;  Writearound  for  trap  instruction. 
skip 


Table  A .23:  Time  cost  of  omitting  the  trap  instruction. 
_ (All  numbers  are  percentages.) _ 


ST 

wnm 

botb 

testActivationRtnim 

instructions 

cycles 

97.21% 

95.91% 

2.79% 

4.09% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

14.20% 

10.59% 

0.02% 

0.01% 

13.80% 

10.16% 

testC  lassOrganizer 

instructions 

cycles 

41.06% 

42.56% 

58.94% 

57.44% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

9.53% 

6.44% 

3.53% 

2.54% 

5.99% 

4.20% 

testCompiler 

instructions 

cycles 

33.42% 

34.07% 

66.58% 

65.93% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

9.38% 

6.28% 

2.35% 

1.62% 

4.70% 

3.21% 

testDecompiler 

instructions 

cycles 

32.19% 

32.38% 

67.81% 

67.62% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

9.31% 

6.35% 

2.51% 

1.73% 

4.70% 

3.22% 

testPrintDefinirion 

instructions 

cycles 

38.01% 

38.09% 

61.99% 

61.91% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

9.35% 

6.83% 

5.64% 

4.13% 

7.05% 

5.16% 

testPrintHierarchy 

instructions 

cycles 

26.25% 

25.90% 

73.75% 

74.10% 

100% 

100% 

trap  instructions  per  instruction 
cost  w/o  trap  instruction 

9.07% 

6.48% 

4.22% 

2.96% 

5.49% 

3.87% 

average  of  macro-benchmarks 

instructions 

cycles 

34.19% 

34.60% 

65.81% 

65.40% 

100.00% 

100.00% 

trap  instructions  per  instruction  9.33%  3.65% 

cost  w/o  trap  instruction  6.48%  2.60% 


5.59% 

3.93% 


172 


Table  A-25: 

Trap  frequencies.  Part  1. 

St 

system 

both 

classOrganizer 

cycles 

42.56% 

57.44% 

100% 

TT’s  per  cycle 

1.53% 

0.00% 

0.65% 

WO’s  per  cycle 

0.53% 

0.05% 

0.23% 

WU’s  per  cycle 

0.43% 

0.13% 

0.18% 

TTs  per  cycle 

0.05% 

0.00% 

0.02% 

total  traps  per  cycle 

2.54% 

0.18% 

1.08% 

compiler 

cycles 

34.07% 

65.93% 

100% 

TTs  per  cycle 

0.91% 

0.00% 

0.31% 

WO’s  per  cycle 

0-56% 

0.09% 

0.19% 

WU’s  per  cycle 

0.51% 

0.12% 

0.17%  1 

TTs  per  cycle 

0.24% 

0.01% 

0.08%  ! 

OS's  per  cycle 

0.00% 

0.02% 

0.00% 

total  traps  per  cycle 

2.22% 

0.24% 

0.76%  ! 

'  decompiler 

cycles 

32.38% 

67.62% 

100% 

TT’s  per  cycle 

0.92% 

0.00% 

0.30% 

WO’s  per  cycle 

0.34% 

0.08% 

0.11% 

WU’s  per  cycle 

0.37% 

0.07% 

0.12% 

TTs  per  cycle 

0.34% 

0.00% 

0.11% 

total  traps  per  cycle 

1.98% 

0.15% 

0.64% 

:  prinrDefinitioi 

cycles 

38.09% 

61.91% 

100% 

TT’s  per  cycle 

0.76% 

0.00% 

0.29% 

WO’s  per  cycle 

0.04% 

0.02% 

0.01% 

WU’s  per  cycle 

0.05% 

0.02% 

0.02% 

TI's  per  cycle 

0.04% 

0.00% 

0.02% 

GS’s  per  cycle 

0.01% 

0.00% 

0.00% 

total  traps  per  cycle 

0.90% 

0.03% 

0.34% 

printHierarcky 

cycles 

25.90% 

74.10% 

100% 

TT’s  per  cycle 

0.28% 

0.00% 

0.07% 

WO’s  per  cycle 

0.38% 

0.03% 

0.10% 

WU’s  per  cycle 

0.27% 

0.07% 

0.07% 

TI’s  per  cycle 

0.28% 

0.00% 

0.07% 

GS's  per  cycle 

0.08% 

0.00% 

0.02% 

total  traps  per  cycle 

1.29% 

0.10% 

0.33% 

Trrrjvjv 


\  •*.  v, 


total  trai 


Table  AJ5:  Trap  frequencies.  Part  2. 

L  . - 

ST 

system 

both 

average  of  macro-benchmarks  j 

cycles 

0.00% 

0.00% 

100.00%  ! 

TTs  per  cycle 

0.88% 

0.00% 

032%  ! 

WO's  per  cycle 

0.37% 

0.05% 

0.13% 

WU’s  per  cycle 

0.33% 

0.08% 

0.11%  ! 

TTs  per  cycle 

0.19% 

0.00% 

0.06%  | 

GS's  per  cycle 

0.02% 

0.00% 

0.00%  | 

i 

total  traps  per  cycle 

1.79% 

0.14% 

0.63%  ! 

AJJ.  Evaluating  the  Performance  Impact  of  Shadow  Registers 


To  ascertain  die  time  cost  of  omitting  shadow  registers  from  SOAR,  we  measured  the 
frequencies  of  the  various  types  of  traps,  estimated  the  added  cost  of  handling  each  type 
without  shadow  registers,  and  multiplied  the  two  together.  One  trap  we  could  not  measure 
was  the  page  fault  trap.  Handling  a  page  fault  takes  so  long  though,  that  the  few  cycles 
saved  by  shadow  registers  will  not  make  much  difference.  The  traps  we  did  include  were: 
integer  tag  traps  (TT)  on  ALU  and  load/store  instructions,  register  window  overflows  (WO) 
on  call  instructions,  register  window  underflows  (WU)  on  return  instructions,  traps  cause  by 
conditional  trap  instructions  (TI),  and  Generation  Scavenge  traps  (GS)  on  store  instructions. 
Of  these,  only  tag  and  Generation  Scavenge  trap  handlers  profit  from  die  shadow  registers. 
Table  A.26  summarizes  our  results.  These  data  seem  to  suggest  that  shadow  registers  do  not 
significantly  improve  performance.  The  maximum  improvement  is  0.12%. 
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Table  A  .26:  Time  cost  of  omitting  shadow  registers. 

(All  figures  in  percents.) 

L  .-  _ 

ST 

system 

both 

testActivationReturn 

cycles 

95.91% 

4.09% 

100% 

shadow  cost  for  GS 

0% 

0% 

0% 

shadow  cost  for  TT 

0% 

0% 

0% 

shadow  cost  for  both 

0% 

0% 

0% 

testClassOrganizer 

cycles 

42.56% 

57.44% 

100% 

shadow  cost  for  GS 

0.00% 

0% 

0.00% 

shadow  cost  for  TT 

0.12% 

0% 

0.05% 

shadow  cost  for  both 

0.12% 

0% 

0.05% 

testCompUer 

cycles 

34.07% 

65.93% 

100% 

shadow  cost  for  GS 

0.00% 

0.01% 

0.00% 

shadow  cost  for  TT 

0.07% 

0% 

0.02% 

shadow  cost  for  both 

0.07% 

0.01% 

0.03% 

testDecompiler 

cycles 

32.38% 

67.62% 

100% 

shadow  cost  for  GS 

0% 

0% 

0% 

shadow  cost  for  TT 

0.04% 

0% 

0.01% 

shadow  cost  for  both 

0.04% 

0% 

0.01% 

testPrintDefinirion 

cycles 

38.09% 

61.91% 

100% 

shadow  cost  for  GS 

0.00% 

0% 

0.00% 

shadow  cost  for  TT 

0.30% 

0% 

0.12% 

shadow  cost  for  both 

0.30% 

0% 

0.12% 

testPrintH  ierarchy 

cycles 

25.90% 

74.10% 

100% 

shadow  cost  for  GS 

0.02% 

0% 

0.01% 

shadow  cost  for  TT 

0.02% 

0% 

0.00% 

shadow  cost  for  both 

0.04% 

0% 

0.01% 

average  of  macro-benchmarks 

cycles 

34.60% 

65.40% 

100.00% 

shadow  cost  for  GS 

0.00% 

0.00% 

0.00% 

shadow  cost  for  TT 

0.11% 

0.00% 

0.04% 

shadow  cost  for  both 

0.11% 

0.00% 

0.04% 
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AJ.9.  Docs  SOAR  Really  Need  Vectored  Traps? 

Suppose  the  reason  for  a  trap  appeared  is  die  PSW  register.  Then,  the  instructions  in 
Table  A 21  would  simulate  the  effect  of  vectored  traps.  As  the  able  shows,  the  cost  would 
be  four  more  cycles  per  trap. 

We  can  then  estimate  the  overall  performance  impact  by  counting  the  number  of  traps 
that  occur  (Table  A .28).  Since  this  would  presumably  allow  us  to  shorten  our  craps  by  a 
cycle,  the  able  also  lisa  the  cost  of  the  extra  trap  cycle  in  die  current  SOAR  system.  The 
able  indicates  that  die  new  effect  of  non-vectored  traps  would  be  a  2.2%  percent  time 
penalty. 


A.4.  Procedure  Calls 


Next  we  examine  SOAR's  features  that  help  procedure  calls. 


A.4.1.  Evaluating  SOAR’s  Register  File  Organization 

Unlike  other  RISCs,  the  chips  designed  at  Berkeley  feature  multiple  overlapping 
on-chip  register  windows.  These  reduce  the  amount  of  saving  and  restoring  for  calls  and 
returns.  If  this  feature  were  left  out  of  SOAR,  then  each  call  would  have  to  save  the  registers 
it  needed,  and  each  return  would  have  to  restore  the  saved  registers.  To  measure  this 
hypothetical  cost,  assuming  no  compiler  optimization,  we  counted  die  number  of  non-nil 
registers  before  each  return  instruction.  This  count  of  modified  registers  was  then  doubled  to 
account  for  both  the  saving  and  restoring  cost  Finally,  we  added  two  cycles  per  return  to 


Table  A27;  Simulating  vectored  traps. 
%jump  — -—  ~  — ~ 

%extnct  psw,  2,  r_temp 

%ret  t 
(jump  able) 

Extra  Cost  4  cycles 


Table  A.28;  Time  cost  of  non-vectored  traps,  Part  1. 


Smalltalk 

System 

both 

testActivationReturn 

instructions 

97.21% 

179% 

100% 

time 

95.91% 

4.09% 

100% 

traps  per  instruction 

0.30% 

0.02% 

0.29% 

cost  of  extra  trap  cycle/all  cycles 

0.22% 

0.01% 

0.21% 

cost  of  non  vectored  traps/all  cycles 

0.89% 

0.04% 

0.85% 

testC  lassQrganizer 


instructions 

41.06% 

58.94% 

100% 

time 

42.56% 

57.44% 

100% 

traps  per  instruction 

3.75% 

0.25% 

1.69% 

cost  of  extra  trap  cycle/all  eye'es 

2.54% 

0.18% 

1.18% 

cost  of  nonvectoted  traps/all  cycles 

10.14% 

0.72% 

1K5&J 

testCompiler 


instructions 

33.42% 

66.58% 

100% 

time 

34.07% 

65.93% 

100% 

traps  per  instruction 

3.31% 

0.35% 

1.34% 

cost  of  extra  trap  cycle/all  cycles 

2.22% 

0.24% 

0.92% 

cost  of  nonvectored  traps/all  cycles 

8.88% 

0.97% 

3.66% 

restPecompiler 


instructions 

32.19% 

67.81% 

100% 

time 

32.38% 

67.62% 

100% 

traps  per  instruction 

2.90% 

0.22% 

1.08% 

cost  of  extra  trap  cycle/all  cycles 

1.98% 

0.15% 

0.74% 

cost  of  nonvectored  traps/all  cycles 

7.90% 

0.59% 

2.96% 

testPrintPefinition 


instructions 

38.01% 

61.99% 

100% 

time 

38.09% 

61.91% 

100% 

traps  per  instruction 

1.23% 

0.05% 

0.50% 

cost  of  extra  trap  cycle/all  cycles 

0.90% 

0.03% 

0.36% 

cost  of  nonvectored  traps/all  cycles 

360% 

0.14% 

1.46% 

tesrPnntH  ierarchy 


instructions 

26.25% 

73.75% 

100% 

time 

25.90% 

74.10% 

100% 

traps  per  instruction 

1.81% 

0.15% 

0.58% 

cost  of  extra  trap  cycle/all  cycles 

1.29% 

0.10% 

0.41% 

cost  of  nonvectored  traps/all  cycles 

5.16% 

0.42% 

1.65% 

Table  AJ28:  Time  cost  of  non-vectored  traps.  Part  2 

• 

r  - 

Smalltalk 

System 

both 

average  of  macro-benchmarks 

instructions 

34.19% 

65.81% 

100.00% 

time 

34.60% 

65.40% 

100.00% 

traps  per  instruction 

2.60% 

0.20% 

1.04% 

cost  of  extra  trap  cycle/all  cycles 

1.79% 

0.14% 

0.72% 

cost  of  non  vectored  traps/all  cycles 

7.14% 

0.57% 

2.89% 

account  for  the  extra  cycle  of  the  loadm  and  storem  instructions.  Table  A.29  presents  these 
data.  SOAR’s  multiple  register  windows  are  the  most  significant  architectural  feature  on  the 
chip:  The  benchmarks  would  take  70%  more  time  without  them. 


How  much  would  the  image  expand  without  register  windows?  The  cost  would  be  two 
instructions  upon  entering  a  subroutine  (a  subtract  to  adjust  a  stack  pointer  and  a  storem  to 
save  registers),  and  two  instructions  for  each  return  from  the  routine  (a  loadm  to  restore  the 
registers  and  an  add  to  restore  the  sp).  Table  A.30  gives  our  analysis. 


AA2.  Number  of  Registers  per  Window 

With  only  eight  registers,  SOAR's  windows  are  much  smaller  than  RISC  II's.  Meas¬ 
urements  of  Berkeley  Smalltalk  suggested  that  this  would  be  sufficient.  To  verify  this  we 
instrumented  our  system  and  ran  some  benchmarks.  When  more  registers  are  needed  for  a 
subroutine,  it  allocates  a  spill  area  in  main  memory.  Thus,  we  merely  counted  the  number  of 
spill  objects  allocated  and  divided  by  the  total  number  of  calls.  Also,  we  measured  how 
many  words  were  spilled  to  determine  how  many  more  registers  were  needed.  Table  A.31 
presents  these  data.  These  data  show  that  SOAR’s  windows  are  large  enough  for 
Smalltalk-80  programs;  more  than  97%  of  the  subroutines  called  fit  into  a  window. 


A.4J.  Analysis  of  Loadm  &  Storem 


The  first  step  in  evaluating  the  impact  of  the  load-  and  store-  multiple  instructions  is  to 
measure  their  frequency.  Since  the  time  to  simulate  one  of  these  instructions  depends  on  the 


Table  A29:  Analysis  of  r 


rwr 


ST 

«y» 

ttstDecompiler 

instructions 

32.19% 

67.81% 

32.38% 

67.62% 

instructions 


38.01% 


61.99% 
% 


cost  of  saving  &  restoring  regs/all  cycles 

cost  of  WO AJ 

net  cost  of  no  res  file 


rf  vs  full  SOAR 


36.17%  30.69% 


les 


retw’s*  /  all  insts 
retw’s*  /  cycles 
avg  regs  used  /  retw* 


rnu 


vs  full  SOAR 


8.68% 

6.20% 

4.01 


retw’s*  /  all  insts 
retw’s*  /  cycles 
avg  regs  used  /  retw* 
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I  Table  AJO:  Static  analysis  of  register  windows. 

routine  entry  points 

4654 

routine  exit  points 

6795 

image  size 

1500  kB 

relative  cost 

6.11% 

Table  AJ1:  Spill  area  analysis. 

ustCompiler 

total  number  of  cycles 
total  number  of  Smalltalk  calls 
number  of  calls  using  spill  area 
total  size  of  spill  areas  actually  needed 

>1,100,000 

>18,000 

430 

883 

avg.  words  of  spill  area  used 
fraction  of  calls  needing  spill  areas 
mean  number  of  cycles  per  spill  allocation 

2.1 

2.3% 

2,600 

testDecompiler 

total  number  of  cycles 
total  number  of  Smalltalk  calls 
number  of  calls  using  spill  area 
total  size  of  spill  areas  actually  needed 

>2,900,000 

>46,000 

1085 

2807 

avg.  words  of  spill  area  used 
fraction  of  calls  needing  spill  areas 
mean  number  of  cycles  per  spill  allocation 

2.6 

2.4% 

2,700 

number  of  registers  actually  accessed,  we  also  gathered  those  data  (Table  A.32).  The  loadm 
and  storem  instructions  rarely  occur,  only  one  in  130  instructions. 

Table  A.33  shows  die  performance  consequences  of  eliminated  this  seldom-used 
feature.  As  expected  from  the  frequency  data,  these  instructions  have  minimal  impact 
SOAR  would  be  only  3%  slower  without  them. 


How  much  larger  would  the  compiled  image  grow  if  we  eliminated  loadm  and  storem? 
Originally,  these  instructions  were  intended  only  for  die  system  code.  In  that  case  there 
would  be  no  significant  static  impact  However,  our  current  strategy  for  spill  areas  requires 
a  routine  that  allocates  a  spill  area  to  initialize  it  We  therefore  instrumented  our  compiler  to 
count  the  number  of  words  initialized  this  way  (Table  A.34).  (We  also  subtracted  out  the 
number  of  rem  instructions  used  solely  to  write  nil  into  several  registers  prior  to  the  storem.) 
Omitting  these  instructions  would  increase  the  size  of  the  system  by  only  2%. 


tuotdwww vj’frrj+j 


T 
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Table  AJ2:  Loadm/storem  execution  frequencies,  Part  1.  | 

ST 

SYS 

both  ! 

testActrvationReturn 

instructions 

97.21% 

2.79% 

100% 

loadms  per  instruction 

0.00% 

5.19% 

0.14% 

loadms  w/  8  tegs 

0.00% 

100.00% 

100.00% 

mean  loadm  legs 

0 

8 

8 

stoiems  per  instruction 

0.00% 

5.19% 

0.14% 

stotems  w  /  8  legs 

0.00% 

100.00% 

100.00% 

mean  storem  tegs 

0 

8 

8 

ttstC iassOrg  anizer  ; 

instructions 

41.06% 

58.94% 

100% 

loadms  per  instruction 

0.00% 

0.62% 

036%  i 

loadms  w /  8  legs 

0.00% 

100.00% 

100.00%  i 

mean  loadmiegs 

0 

8 

8  ! 

stoiems  per  instruction 

0.74% 

0.65% 

0.69% 

storems  w/  5  legs 

0.00% 

0.13% 

0.07% 

storems  w/ 6  regs 

0.00% 

0.00% 

0.00% 

stotems  w /  7  legs 

100.00% 

5.06% 

46.89%  ! 

storems  w/ 8  tegs 

0.00% 

94.81% 

53.04%  i 

mean  storem  legs 

7 

7.95 

733  I 

i  testCompiler  I 

instructions 

33.42% 

66.58% 

100% 

loadms  per  instruction 

0.00% 

0.67% 

0.45% 

loadms  w/  7  tegs 

0.00% 

17.70% 

17.70% 

loadms  w /  8  regs 

0.00% 

82.30% 

8230% 

mean  loadm  legs 

0 

7.82 

7.82 

storems  per  instruction 

0.75% 

0.65% 

0.69% 

storems  w/  4  regs 

0.05% 

0.00% 

0.02% 

storems  w/  S  tegs 

0.85% 

0.12% 

039%  i 

stoiems  w/  6  legs 

2.72% 

0.00% 

1.00%  1 

storems  w/  7  regs 

96.38% 

1534% 

45.21%  1 

storems  w/  8  regs 

0.00% 

8433% 

5338%  | 

mean  storem  regs 

6.95 

7.84 

732 
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Table  AJ2:  Loadm/storem  execution  frequencies.  Part  2. 

1 

ST 

SYS 

both 

testDtcompiler 

instructions 

32.19% 

67.81% 

100% 

loading  per  instruction 

0.00% 

0.33% 

024% 

loadms  w/  8  regs 

0.00% 

100.00% 

100.00% 

mean  loadm  regs 

0 

8 

8 

storems  per  instruction 

0.73% 

0.51% 

038% 

storems  w  /  4  regs 

0.62% 

0.00% 

025% 

storems  w /  5  regs 

0.00% 

0.00% 

0.00% 

storems  w/  6  regs 

0.62% 

0.00% 

025% 

storems  w/  7  regs 

98.76% 

31.02% 

5835% 

storems  w/  8  regs 

0.00% 

68.98% 

41.15% 

mean  storem  regs 

6.98 

7.69 

7.40 

j  testPrintDefinition 

instructions 

38.01% 

61.99% 

100% 

loadms  per  instruction 

0.00% 

0.06% 

0.04% 

loadms  w/  8  regs 

0.00% 

100.00% 

100.00% 

mean  loadm  regs 

0 

8.00 

8.00 

storems  per  instruction 

0.00% 

0.14% 

0.09% 

storems  w/  5  regs 

0.00% 

2.13% 

2.13% 

storems  w/  6  regs 

0.00% 

0.00% 

0.00% 

storems  w /  7  regs 

0.00% 

55.32% 

5532% 

storems  w/  8  regs 

0.00% 

4235% 

4235% 

mean  storem  regs 

0 

7.38 

738 

j  testPrintHierarchy 

instructions 

26.23% 

73.75% 

100% 

loadms  per  instruction 

0.00% 

0.27% 

020% 

loadms  w/  7  regs 

0.00% 

14.37% 

1437% 

loadms  w/  8  regs 

0.00% 

85-63% 

85.63% 

mean  loadm  regs 

0 

7.86 

7.86 

storems  per  instruction 

0.24% 

0.43% 

0.38% 

storems  w/  5  regs 

0.00% 

433% 

3.79% 

storems  w/  6  regs 

0.00% 

0.00% 

0.00% 

storems  w/  7  regs 

100.00% 

4131% 

51.10% 

storems  w /  8  regs 

0.00% 

53.96% 

45.11% 

mean  storem  regs 

7 

7.45 

7.38 
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Table  A32:  Loadm/storem  execution  frequencies.  Part  3. 


ST 

SYS 

both 

avg  of  macros 

instructions 

34.19% 

65.81% 

100% 

loedms  per  instruction 

0% 

026% 

loedms  w /  7  regs 

0% 

6.41% 

6.41% 

loedms  w/  8  regs 

0% 

9339% 

9339% 

mean  loadm  regs 

0 

7.94 

7.94 

store  ms  per  instrucdoc 

0.48% 

store  ms  w/  4  regs 

0.13% 

0% 

0.05% 

storems  w/  5  regs 

0.17% 

1.38% 

1.28% 

storems  w/  6  regs 

0.67% 

0% 

0.25% 

storems  w/  7  regs 

79.03% 

29.69% 

5137% 

storems  w/  8  regs 

0% 

68.93% 

47.05% 

mean  storem  regs 

539 

7.66 

7.44 

Table  AJ3:  Time  cost  of  omitting  loadm  &  storem. 

_ (All  costs  in  percents.) _ 

benchmark  ST  SYS  both 

_ testActivationRerurn _ _ 

cycles _ 95.91%  4.09ft  100% 

loadm  cost/all  cycles  Oft  18.23ft  0.75ft 

storem  cost/all  cycles  Oft  18.23%  0.75ft 

tool  cost  Oft  36.47ft  1.49ft 

ttstClassOrganizer 


cycles 

42.56ft 

57.44ft 

100% 

loadm  cost 

Oft 

3.11ft 

1.79% 

storem  cost 

2.99ft 

3.26% 

3.14% 

total  cost 

2.99ft 

6.37ft 

4.93% 

testCompiler 

cycles 

34.07ft 

65.93ft 

100% 

loadm  cost 

Oft 

3.15% 

2.08% 

storem  cost 

3.01ft 

3.08ft 

3.06% 

total  cost 

3.01ft 

6.24ft 

5.14% 

testDecompiler 

cycles 

32.38ft 

67.62ft 

100% 

loadm  cost 

Oft 

1.71% 

1.15% 

storem  cost 

2.98ft 

2.37ft 

251% 

total  cost 

2.98ft 

4.07ft 

3.72% 

testPrintOefinition 

cycles 

38.09ft 

61.91ft 

100% 

loadm  cost 

Oft 

0.30% 

0.19% 

storem  cost 

Oft 

0.65% 

0.40% 

total  cost 

Oft 

0.96ft 

059% 

testPrintHierarchy 

cycles 

25.90ft 

74.10ft 

100% 

loadm  cost 

Oft 

1.31% 

0.97% 

storem  cost 

1.02ft 

1.96% 

1.72% 

total  cost 

1.02ft 

3.28% 

2.69% 

macro  avg. 

cycles 

34.60ft 

65.40% 

100% 

loadm  cost 

Oft 

1.92% 

1.24% 

storem  cost 

2ft 

2.26% 

2.18% 

total  cost 

2ft 

4.18% 

3.41% 

Table  A34:  Raw  data  for  static  analysis  of  store  multiple. 

description _ count _ 

cost  for  storem _ 7363  words _ 

total  SOAR  image  size  1500  kB  _ 

relative  static  cost  1.96ft  ~~  ™ 


A.4.4.  Performance  of  Inline  Caching 


First,  we  measured  the  cost  of  SOAR's  in-line  cache.  In  other  words,  if  no  procedure 
lookups  were  needed,  how  much  faster  could  SOAR  run?  To  evaluate  SOAR’s  in-line 
cache,  we  counted  the  occurrences  of  the  cache  probe  conditional  trap  instruction.  That  gave 
ns  the  number  of  probes.  Then,  since  the  prologue  takes  five  cycles,  we  can  easily  get  the 
probe  time.  For  the  misses,  are  added  two  components:  the  miss  trap  handler  time,  obtained 
by  multiplying  the  number  of  misses  (trap  instruction  traps)  by  the  crap  handler  path  length, 
and  the  lookup  time,  obtained  directly  from  an  execution  profile.  Table  A.3S  summarizes 
these  data,  which  show  that  in-line  caching  takes  a  lot  of  time;  23%  of  SOAR’s  time  is  spent 
testing  the  cache  and  handling  misses.  Without  any  caching  at  all,  the  probe  time  would 
decrease  to  aero,  but  the  miss  time  would  increase  by  a  factor  of  l/3.53%=28.  In  other  words, 
what  takes  100  seconds  with  in-line  caching  would  take  100-10.88+12.46x28=438  seconds. 
SOAR  would  be  four  times  slower  with  no  cache  at  all. 

Next,  we  compared  the  23%  cost  for  the  in-line  cache  with  other  caching  schemes. 
One  of  these  was  die  hash  able  cache  found  in  interpretive  Smalltalk-80  systems.  The  other 
scheme  was  an  in-line  indirect  cache.  Each  call  would  jump  through  a  per-process  area  with 
each  process's  cache  entries.  Table  A.36  shows  die  code  sequences  needed  for  these  two 
types  of  cache.  The  hash  able  cache  is  the  most  expensive  scheme,  requiring  23  cycles  for  a 
cache  probe.  SOAR’s  in-line  cache  requires  a  prologue  of  only  S  cycles.  The  indirect 
scheme  adds  a  cycle  for  the  indirect  call  and  one  for  an  indirect  load  in  the  prologue  for  a 
total  of  7. 

Assuming  that  the  cache  miss  cost  is  independent  of  the  caching  scheme,  we  can  use 
the  cache  probe  frequency  data  to  calculate  the  costs  of  these  caching  schemes  (Table  A.37). 
The  bottom  line  in  the  table  gives  the  average  speed  of  the  various  schemes.  SOAR  would 
run  only  75%  as  fast  as  it  does  now  with  a  conventional  hash  able  cache.  In  other  words, 
the  work  that  requires  100  cycles  would  take  133  with  a  conventional  cache. 


Table  AJ5:  Inline  cache  performance  evaluation. 


description  ST  system 


testActivationReturn 


Parti 


both 


instructions 

cycles 


probes  per  inst 
probes  per  cycle 
loadc  traps  per  probe 


probe  insts  per  inst 
loadc  trapH  insts  per  inst 
be  &  trapH  insts  per  inst 


probe  cycles  per  cycle 
loadc  trapH  cycles  per  cycle 
miss  trapH  cycles  per  cycle 


probe  &  trapH  cycles  per  cycle 
total  miss  time 


total  cache  time 


97.21  % 
95.91  % 


9.47% 

7.06% 

0% 

0% 


28.40% 

0% 

28.40% 


35.32% 

0% 


2.79% 

4.09% 


1% 
0.01% 
0% 
0% 


0.03% 

0-0% 

0.03-0.03% 


0.03% 

0-0% 


35.32%  0.03-0.03% 


100% 

100% 


9.20% 

6.77% 


27.61% 

0-0% 

27.61-27.61% 


33.87% 

0-0% 

0% 


33.87-33.87% 

0% 


33.87-33.87% 


probes  per  inst 
probes  per  cycle 
loadc  traps  per  probe 
misses  per  probe 


testClassOrganizer 


|j  7.24% 
|j  4.90% 
H  25.39% 
0.96% 


58.94% 

57.44% 


0.05% 

0.04% 

0-0% 

0% 


probe  cycles  per  cycle 
loadc  trapH  cycles  per  cycle 
miss  trapH  cycles  per  cycle 

24.48%  0.18% 

8.70%  0-0% 

0.14%  0% 

probe  &  trapH  cycles  per  cycle 
total  miss  time 

33.18%  0.18-0.18% 

total  cache  tune 

100% 

100% 


3.00% 

2.10% 

25.15-25.15% 

0.95% 


9.01% 

2.27-2.27% 

11.27-11.27% 


10.52% 

3.70-3.70% 

0.06% 


14.22-14.22% 

2.66% 


16.88-16.88% 


Table  A-35:  Inline  cache  performance  evaluation 

i.  Part  3. 

description  ST  system 

both 

testPrint Definition 

instructions  ij  38.01%  61.99% 

cycles  ii  38.09%  61.91% 

100% 

100% 

probes  per  inst 
probes  per  cycle 
loadc  traps  per  probe 


7.98% 

5.83% 

1.03% 


0.73% 


3.06% 

2.24% 

1.02-1.02% 

0.72% 


9.18% 

0.09-0.09% 

9.27-9.27% 


1121% 

0.16-0.16% 

0.05% 


probe  &.  trapH  cycles  per  cycle 

|i  29.59% 

0.15-0.15% 

11.37-11.37% 

total  miss  time 

it 

l> 

1.95% 

total  cache  time 

13.31-13.31% 

testPrintHierarchy 

instructions 

;;  26.25% 

73.75% 

100% 

cycles 

‘  25.90% 

74.10% 

100% 

probes  per  inst 

7.62% 

0.16% 

2.12% 

probes  per  cycle 

5.44% 

0.11% 

1.49% 

loadc  traps  per  probe 

4.47% 

0-0% 

4.22-4.22% 

misses  per  probe 

5.13% 

0% 

4.84% 

probe  insts  per  inst 

22.86% 

0.48% 

6.36% 

loadc  trapH  insts  per  inst 

1.02% 

0-0% 

0.27-0.27% 

probe  &  trapH  insts  per  inst 

23.88% 

0.48-0.48% 

6.62-6.62% 

probe  cycles  per  cycle 

27.20% 

0.56% 

7.46% 

loadc  trapH  cycles  per  cycle 

1.70% 

0-0% 

0.44-0.44% 

miss  trapH  cycles  per  cycle 

0.84% 

0% 

0.22% 

probe  &  trapH  cycles  per  cycle 

28.90% 

0.56-0.56% 

7.90-7.90% 

total  miss  time 

1; 

18.52% 

total  cache  time 

26.42-26.42% 

Table  AJ5:  Inline  cache  performance  evaluation, 


description 


average  of  macro-benchmarks 


instructions  j  34.19%  65.81% 

cycles  I-  34.60%  65.40% 


probes  per  mst 
probes  per  cycle 
loadc  traps  per  probe 
misses  per  probe 


probe  insts  per  inst 
loadc  trapH  insts  per  inst 
probe  &  trapH  insts  per  inst 


probe  cycles  per  cycle 
loadc  trapH  cycles  per  cycle 


probe  &  trapH  cycles  per  cycle 
total  miss  time 


total  cache  time 


7.47% 
5.1 
2.6 
3.7 


22.40% 

177% 

25.17% 


25.96% 

4.39% 


30.35% 


0.13% 

0.09% 

0.00-0.43% 

0.00% 


0.40% 

0.00% 

0.40% 


0.46% 

0.00% 


0.46-0.47% 


Part  4. 


both 


100.00% 

100% 


2.64% 

1.86% 

12.21-12.23% 

3.53% 


7.93% 

0.99% 

8.91-8.92% 


9.28% 

1.60% 


10.88% 

12.46% 


23.34% 


Table  A 36:  Code  sequences  for  various  caches. 


Hash-table  Cache 


loadc 

%load 

%xor 

«load 

%and 

%sla 

%sla 

(rl4)classOffset,  t6% 
(rl5)0,  r5;  sel 
>5,  rt,  r4% 
pcRel(misk),  r3% 
r3,  r4,  r4% 
r4,t49S> 
r4,  i4% 

%load 

pcRel(base),  r3% 

%add 

r3,  i4,  t4% 

%load 

(r4)cachcClass,  r3% 

%trap3 

ne  r3,  r49fc 

%load 

(r4)cacheSel,  r3% 

%trap3 

ne  r3,  r4% 

%load 

(r4)cacheTarget,  r3% 

%ret 

r3,0% 

Time  cost:  23  cycles 

Indirect  Inline  Cache 


<  indirect  call> 

loadc 

(rl4)classOffset.  16% 

9bload 

(rl3)0,  r3% 

«load 

(r5)rCacheBase.  i5;  uses  global  OR  mapping 

%trap3 

ne  rS,  r6% 

Time  cost:  7  cycles  +  1  cycle  for  inditect  call 

SOAR  Inline  Cache 
(r!4)clissOffset,  tf>% 


Table  A37:  Relative  Performance  of  various  caching  schemes. 

_ (SOAR  m  700%,  faster  is  better.) _ 


no 

hash 

indirect 

SOAR 

aero  time  ; 

i 

1 

cache 

table 

inline 

cache 

resolution  ' 

j  testAcuvaoonRetura 

151.23% 

45.06% 

83.04% 

100% 

151.23%  : 

;  testClassOrganizer 

28.13% 

72.53% 

91.86% 

100% 

120.23%  j 

•  testCompiler 

25.05% 

76.10% 

93.03% 

100% 

134.11%  ! 

!  testDecompiler 

23.35% 

76.57% 

93.28% 

100% 

151.74%  | 

:  testPrintDefinition 

28.58% 

71.26% 

91.67% 

100% 

115.30%  ! 

:  testPrintHierarchy 

22.14% 

78.82% 

94.19% 

100% 

135.51%  ! 

average 

25.45% 

75.06% 

92.81% 

100% 

13138%  ! 

Next  we  examine  the  space  impact  of  these  caching  strategies.  Table  A.38  presents  the 
raw  data  we  have  collected  from  the  compiler.  The  total  space  taken  by  SOAR’s  in-line 
caching  scheme  is  the  sum  of  the  number  of  extra  words  needed  to  hold  the  last  class  for  the 
sends  (measured  by  the  number  of  cache  slots),  and  the  space  consumed  by  the  method  pro¬ 
logues.  The  number  of  prologues  is  the  same  as  die  number  of  cache  probes.  Table  A.39 
illustrates  this  prologue.  Table  A.40  below  shows  the  amounts  of  overhead  at  the  call  site 
and  at  the  method  prologue  for  the  various  caching  schemes.  Finally,  we  can  combine  this 
data  to  show  the  impact  that  each  scheme  would  have  (Table  A.41).  Thus,  the  hash  table 
cache  would  save  124%  of  the  image  space 


Table  AJ8: 

Raw  data  for  static  analysis  of  caching. 

call  sites 

22025 

cache  probes 

4654 

image  size 

1500  kB 

Table  AJ9:  Inline  cache  prologue. 

<  selector* 

needed  to  handle  misses 

%loadc 

(rl4)0,  ri) 

get  receiver’s  class 

%load 

(rl5)0.  rl 

get  last  class  for  send 

%trapl 

ne  ri),  rl 

verify  cache 

192 
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Table  A.40:  Space  overhead  for  the  various  cochins  schemes. 


call  site  overhead  p 

rologue  overhead 

no  lookups 

0 

0 

in-line  cache 

1 

4 

indirect  in-line  cache 

3 

4 

hash  table 

1 

0 

Table  A.41:  Net  space  impact  of  caching  schemes. 

no  lookups  2.71%  savings 

in-line  cache  0 

indirect  in-line  cache  2.94%  cost 

hash  table _ 1.24%  savings _ 

A.4.5.  How  Fast  Does  SOAR  Shuffle? 

SOAR  is  a  nimble  processor,  jumps  and  branches  only  take  one  cycle.  To  understand 
the  significance  of  this  feature,  we  can  examine  die  frequency  of  jumps  and  calk  (Table 
A.42).  As  the  table  shows,  jumps  and  calk  are  popular  instructions;  one  instruction  in  10  is 
a  jump  and  one  in  17  is  a  call.  Given  the  frequency  data,  we  can  add  the  extra  cycle  SOAR 
would  require  without  a  fast  shuffle  (Table  A.43).  These  data  show  that  SOAR  would  be 
1 1  %  slower  without  the  fist  shuffle  mechanism. 

A.4.6.  Evaluation  of  Parallel  Regkter  Initialization 

If  the  return  instruction  could  write  nil  into  six  registers  at  once,  each  routine  would 
have  to  write  nil  into  its  temporary  variable  registers  sequentially.  Using  [Bla83a]  page  139. 
Benchmark  column,  one  can  compute  an  average  of  1.19  arguments  and  temporaries  per 
call,  excluding  the  receiver.  Since  the  average  number  of  arguments  per  call  is  0.88 
fMeC83]  (pp  185,  Fig.  10.3)  we  assume  that  the  average  number  of  temporaries  per  call  is 
between  zero  and  one.  This  gives  the  number  of  extra  cycles  required  per  call.  To  measure 
the  number  of  calls  requiring  nilling,  we  used  the  number  of  return  instructions  that  changed 
the  window.  This  way,  we  also  included  returns  from  interrupts.  Table  A.44  presents  our 
measurement  of  the  extra  time  that  serial  instead  of  parallel  nilling  would  take,  assuming  no 


Table  A.42:  Frequency  of  jump  aud  call 


ST 


instructions. 


ttstA  ctivarionPerurn 

instructions 

97.21% 

179% 

100% 

jumps 

5.03% 

10.50% 

5.18% 

calls 

9.62% 

0.08% 

9.35% 

&  calls 


<fe  calls 


14.65%  10.58% 


testClassOrganixer 


29.62% 


usiC 


10.10% 


iler 


A  calls  27.99%  10.84%  16.57% 


14.53% 


instructions 

41.06% 

58.94% 

100% 

jumps 

15.10% 

8.96% 

11.48% 

calls 

14.51% 

1.14% 

6.63% 

18.11% 


instructions 

33.42% 

66.58% 

100% 

jumps 

1415% 

8.95% 

10.72% 

calls 

13.74% 

1.89% 

5.85% 

testuecompiler 


instructions  32.19%  67.81%  100% 


jumps  12.91%  8.66%  10.03% 

calls  13.23%  1.88% .  5.54% 


jumps  &  calls  26.14%  10.55%  15.57% 


testPrintDefmition 


instructions  38.01%  61.99%  100% 


jumps 

12.84% 

5.51% 

re 

calls 

13.50% 

1.89% 

E 


mps  «  calls  26.34% 


jumps  &  calls 


7.40% 


testP rintHierar chy 

instructions 

2615% 

73.75% 

100% 

jumps 

12.41% 

7.85% 

9.04% 

calls 

13.73% 

1.23% 

4.51% 

26.14% 

9.07% 

13-55% 

average  of  macros 

instructions 

34.19% 

65.81% 

100% 

jumps 

13.50% 

7.99% 

9.91% 

calls 

13.74% 

1.61% 

5.77% 

L.«a  ,ll 


Table  A.43:  Cost  of  omitting  fast  shuffle. 


i 

ST 

system 

both  1 

|  testActivadonRenirn  | 

cycles 

93.91% 

4.09% 

100% 

jump  cost 

3.73% 

5.27% 

3.82% 

call  cost 

7.17% 

0.04% 

6.88% 

total  cost 

10.93% 

5.31% 

10.70% 

testClassOrganizer 

I  cycles 

42.56% 

57.44% 

100% 

!  jump  cost 

10.21% 

6.44% 

8.05% 

call  cost 

9.81% 

0.82% 

4.65% 

total  cost 

20.02% 

7.26% 

12.69% 

testCompiler 

cycles 

34.07% 

65.93% 

100% 

jump  cost 

9.55% 

6.18% 

7.32% 

call  cost 

9.21% 

1.30% 

4.00% 

total  cost 

18.76% 

7.48% 

11.32% 

testDecompiler 

cycles 

32.38% 

67.62% 

100% 

jump  cost 

8.80% 

5.96% 

6.88% 

call  cost 

9.02% 

1.30% 

3.80% 

total  cost 

17.82% 

7.25% 

10.67% 

|  testPrintDefinition 

38.09%  61.91% 


jump  cost 

938% 

4.04% 

6.07% 

call  cost 

9.87% 

1.38% 

4.61% 

total  cost 

19.25% 

5.42% 

10.69% 

testrrintH it  rare 


cycles 

25.90% 

74.10% 

100% 

jump  cost 

8.86% 

5.50% 

6.37% 

call  cost 

9.80% 

0.86% 

3.18% 

total  cost 

18.66% 

6.36% 

9.55% 

average  of  macro  benchmarks 

cycles 

34.60% 

65.40% 

100% 

call  cost 

934% 

jump  cost 

9.36% 

total  cost 

18.90%  i 

changes  in  compiler  strategy.  The  data  show  that  SOAR  would  run  4%  slower  without 
parallel  nilling. 

To  analyze  the  impact  of  parallel  nilling  on  the  size  of  the  compiled  image,  we  instru¬ 
mented  our  compiler  (Table  A.43).  To  do  this,  we  kept  a  running  total  of  the  number  of 
temporary  variables  that  would  be  kept  in  registers.  Assuming  that  each  variable  would 


34.60% 

65.40% 

100.00% 

OA. 

1.80 

na. 

0-1 

na. 

na. 

9.01% 

4.07% 

5.73% 

6.24% 

2.89% 

4.02% 

0.00%-6.25% 

5.05% 

3.27%-5.44% 

Tabic  A.44:  Evaluation  of  parallel  nilling,  Part  2. 


testPrintHierarchx 


avg.  regs  containing  pointers  per  rerw*  njL 

avg  temp  vars  0-1 

retw’s+a  per  inst  8.68% 

retw’s#a  per  cycle  6.20% 

cost  of  nilling  0%-6.20%  4.42%  3.28%-4.89% 


average  of  macro-benchmarks 


instructions  34.19%  65.81% 

es 


avg.  regs  containing  pointers  per  netw*  n.a.  1.80 

avg  temp  vars  0-1  na. 

retw’s+a  per  inst  9.01%  4.07% 

retw ’shaper  cycle  6.24%  2.89% 

cost  of  nillin 


require  an  additional  instruction  to  nill  it,  we  can  then  compute  the  space  overhead  nilling 
would  require  without  hardware  support.  The  table  shows  that  our  image  would  be  1 .29% 
larger  if  SOAR  lacked  this  feature. 

A.4.7.  Return  Options 

The  inclusion  of  three  optional  operations  in  SOAR’s  return  instruction  add  some  com¬ 
plexity  to  die  architecture.  Which  of  the  possible  combinations  are  really  used?  Table  A.46 
shows  our  dynamic  frequency  data.  As  expected,  the  normal  return,  remw  was  used  nearly 
three  quarters  of  the  time.  Although  seven  out  of  the  eight  possible  versions  were  actually 
used,  only  ret,  reti,  rerw,  and  remw  are  essential,  the  rest  could  be  omitted.  The  other  10% 


Table  A.45:  Static  analvsis  of  parallel  nillin 


nilling  cost  for  temporary  variables  2348 
nilling  cost  for  spill  initialization  2472 


total  SOAR  image  size  1500  kB 


relative  static  cost  to  nil  temps  0.63% 

relative  static  cost  to  nil  spill  obi.  0.66% 


total  static  cost  for  serial  nilling  1 .29% 
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Table  A.46;  D>  larnk  frequency  of  return  options.  Part  1 


testActivationReturn 

returns  per  instruction 

9.78% 

returns  per  cycle 

7.20% 

%rcti’s  per  return 

1.48% 

%retn's  per  return 

1.48% 

%retnw’s  per  return 

0.01% 

retnw's  per  return 

9534% 

%retiw’s  per  return 

1.48% 

testC  lassOrganizer 

returns  per  instruction 

8.03% 

returns  per  cycle 

6.46% 

%ret’s  per  return 

4.72% 

%reti’s  per  return 

12.59% 

%retn*s  per  return 

5.90% 

retn's  per  return 

0.03% 

%retw’s  per  return 

2.26% 

retw’s  per  return 

0.48% 

%  retnw’s  per  return 

11.92% 

retnw’s  per  return 

58.20% 

%retiw’s  per  return 

3.90% 

ttstCompiler 

returns  per  instruction 

8.18% 

returns  per  cycle 

5.59% 

%ret’s  per  return 

3.91% 

%red’s  per  return 

11.78% 

%retn’s  per  return 

9.24% 

rein’s  per  return 

0.13% 

%  retw’s  per  return 

1.58% 

retw’s  per  return 

0.53% 

%remw's  per  return 

16.07% 

retnw’s  per  return 

52.16% 

%retiw’s  per  return 

4.48% 

%retinw’s  per  return 

0.12% 

Tabic  A.46:  Dynamic  frequency  of  return  options,  Part  2. 

testDtcompiler 

returns  per  instruction 

7.38% 

returns  per  cycle 

5.06% 

%ret’s  per  return 

4.73% 

%reti’s  per  return 

11.37% 

%retn's  per  return 

8.77% 

retn's  per  return 

0.36% 

%retw’s  per  return 

0.55% 

retw’s  per  return 

0.02% 

%retnw’s  per  return 

13.33% 

reaw’s  per  return 

57.61% 

%reuw’s  per  return 

3.26% 

!  testPrintDefuiition 

returns  per  instruction 

7.84% 

returns  per  cycle 

5.74% 

%ret’s  per  return 

8.45% 

%reti*s  per  return 

5.87% 

%retn’s  per  return 

1.90% 

%retw’s  per  return 

4.74% 

%retnw*s  per  return 

11.48% 

ream’s  per  return 

67.08% 

%retiw’s  per  return 

0.47% 

testPrintHierarchy 

returns  per  instruction 

5.68% 

returns  per  cycle 

4.00% 

%ret*s  per  return 

5.29% 

%reti's  per  return 

7.18% 

%rem’s  per  return 

7.76% 

retn’s  per  return 

0.17% 

%retw’s  per  return 

1.02% 

%retnw’s  per  return 

12.84% 

remw’s  per  return 

62.64% 

%redw’s  per  return 

3.04% 

%rednw*s  per  return 

0.06% 

of  die  returns  would  just  require  an  extra  cycle  or  two  to  synthesize.  Since  a  return  only 
occurs  about  ooe  in  twenty  cycles,  the  effect  would  be  to  add  a  cycle  or  two  every  200 
cycles.  This  would  degrade  performance  less  than  1%. 

AJ.  Storage  Management 

This  section  contains  an  evaluation  of  SOAR’s  features  to  help  manage  storage. 

Ai.l.  Evaluation  of  the  Generation  Scavenge  Tag  Checking  Hardware 


ha 

Si 

i 
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The  second  step  is  to  examine  the  cost  of  doing  the  check  in  software  (Table  A.48):  simulat¬ 
ing  this  feature  takes  four  cycles.  The  number  of  tagged  stores  executed  per  cycle  can  then 
be  multiplied  by  the  simulation  cost  (Table  A.49).  Hie  result  of  this  calculation  is  that  the 
worst-case  macro-benchmark  would  run  only  3%  slower  without  this  feature. 

Next  we  examine  the  space  cost  of  eliminating  the  generation  tag  checking  hardware. 
Table  A.50  gives  die  static  frequency  of  these  store  instructions.  As  expected  from  the  rarity 
of  execution,  tagged  stores  account  for  very  little  of  the  code,  or  about  2%. 

Finally,  we  multiply  die  3  word  space  penalty  by  the  static  frequency  (Table  A.51)  to 
compute  that  the  Smalltalk-80  image  would  grow  by  only  3%  if  tagged  stores  were  removed 
from  SOAR. 


I  Table  A.48:  Writearound  for  tamed  stores. 

%SK>re 

(a)i,  b 

%and 

a.  Oxf  «  28,  ta 

%and 

b,  Oxf  «  28,  tb 

%  trap 

It  ta,  tb;  trap  if  a  younger 

*trap 

eq  ta.  Oxf;  trap  if  a  is  a  context 

dynamic  cose 

4  cycles 

static  cost 

4  words 

Table  A.49:  Time  cost  of  omitting  GS  Tag  Trap  Store. 

_ of  total  cycles) _ 

)|  all  cycles  ;  store  cost  cycles 


benchmark  | 

|  ST 

i  '  testP 

opS 

tore  Inst  Var  | 

% 

system  ;  ST 


system  both 


AJL2.  Frequency  of  GS  traps 


One  last  interesting  measurement  is  the  cost  of  the  Generation  Scavenging  trap.  Table 
AJ2  gives  die  frequency  of  store  traps.  These  data  indicate  that  only  3.9%  of  the  tagged 
stores  trap.  Since  the  path  length  for  the  store  trap  handler  is  40  cycles  (including  the  code 
to  renumber  the  object),  the  time  spent  handling  these  traps  is 

40fgfcf.x3.9ft  .-J Mg”— xo.36%  7*. 

trap  taggedstorr  instruction  1.5  cycles 

The  time  for  store  traps  is  insignificant. 

AiJ.  Evaluating  the  Pointer  to  Register  Support 

The  po in tcr-to- register  circuitry  includes  a  comparator  and  a  significant  amount  of  con¬ 
trol  complexity  [Pen85b].  How  well  could  SOAR  get  along  without  it?  There  are  two  cases 
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to  analyze: 
diisContext 


In  Smalltalk-80,  a  routine  can  request  a  pointer  to  its  aedvation  record  by  accessing  the 


Table  A32:  Dynamic  frequency  of  tagged  store  GS  traps. 

(Given  as  percentage  of  ST,  system,  both  tagged  stores  executed.) 

benchmark 

ST 

system 

both 

testPopStoreinstVar 

0% 

0% 

0% 

testClassOrgamzer 

0.30% 

0% 

0.24% 

testCompiler 

0.24% 

4.83% 

'  2.71% 

testDecompiler 

0% 

0% 

0% 

testPrintDefini  don 

2.63% 

0% 

2.63% 

testPrintHierarc  hy 

21.05% 

0% 

'  13.79% 

avg  macros 

4.84% 

0.9.7% 

3.87% 
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pseudo-variable  thisContext.  In  this  case,  the  compiler  must  give  out  an  illegal 
(unmapped)  address.  When  the  program  tries  to  use  this  address,  the  page  fault 
handler  can  then  ensure  the  activation  record  resides  in  memory  and  not  on-chip,  then 
complete  the  operation.  Fortunately,  this  case  mostly  occurs  in  the  debugger,  where  a 
speed  penalty  is  more  acceptable. 

blockCopy 

A  SmalltaDc-80  block  permits  execution  of  a  piece  of  code  in  one  procedure  to  be  con¬ 
trolled  by  another  procedure.  We  implement  this  feature  with  a  distinct  activation 
record  that  contains  a  pointer  to  the  defining  activation  record.  Thus,  the  code  in  a 
block  can  access  the  data  in  its  home  activation  record  with  loads  and  stores.  If  we 
eliminate  the  pointer-to-register  circuitry  from  SOAR,  we  merely  need  re  flush  a 
block’s  home  activation  record  out  to  memory  when  entering  the  block.  This  may 
involve  flushing  extra  register  windows  until  we  reach  the  desired  one.  On  the  other 
hand,  die  desired  window  may  already  be  in  memory.  We  ran  the  benchmarks  and 
simulated  the  cost  of  this  scheme.  Every  time  control  entered  a  block,  we  counted  the 
number  of  windows  that  would  have  to  be  flushed.  The  first  column  of  Table  A .53 
give  the  number  of  block  invocations,  and  the  second  gives  the  average  number  of 
windows  flushed  per  invocation.  We  have  assumed  an  18  cycle  cost  re  flush  a  win¬ 
dow;  nine  cycles  to  save  it,  and  another  nine  to  restore  il  This  estimate  is  probably 
low  since  it  omits  the  cost  of  handling  the  extra  traps.  The  third  column,  which  is  the 
cycles  spent  flushing  windows  per  invocation,  is  just  1 8  times  the  second.  The  next 
two  columns  give  the  frequency  of  block  invocations  per  cycle  in  compiled  Smalltalk 
code,  and  the  cost  of  simulating  pointer-to-register  per  cycle  in  compiled  Smalltalk 
code.  Finally,  the  last  two  columns  give  the  same  data,  but  relative  to  the  total  time, 
not  just  the  time  executing  compiled  code.  These  data  show  that  SOAR  would  be  only 
3%  slower  without  the  pointer-to-register  feature. 
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Table  A.53:  Time  cost  of  eliminating  pointer-to-register  hardware. 


block 

invoks 

windows/ 

invok 

cycles/ 

invok 

values/ 

ST  cycle 

cost/ 

ST  cycle 

values/ 

cycle 

cost / 
cycle 

4023 

0.92 

16.6 

0.29% 

4.89% 

0.13% 

2.08% 

906 

0.50 

9.0 

0.24% 

2.20% 

0.08% 

0.75% 

2785 

1.40 

25.2 

0.30% 

7.49% 

0.10% 

2.43% 

149 

2.02 

36.4 

0.53% 

19.2% 

0.20% 

7.31% 

152 

1.30 

23.4 

0.50% 

11.68% 

0.13% 

3.02% 

1603 

1.23 

22.1 

0.37% 

9.09% 

0.13% 

3.12% 

benchmark 

dassOrganizer 

compiler 

decompiler 

printOefinitton 

printHierarchy 

average _ 


A.6.  Implementation 

We  have  examined  two  implementation-related  issues:  eliminating  register  forwarding 
and  the  relative  proportions  of  data-  and  instruction-fetches. 

A.6.1.  Register  Forwarding 

How  important  is  the  register  forwarding  in  SOAR's  datapath?  To  get  a  crude  idea,  we 
measured  how  often  our  simulated  instructions  used  a  forwarded  value  and  assessed  a 
penalty  of  one  cycle.  Table  A.54  shows  the  results  of  this  measurement-  Forwarding  is 

_ Table  AJS4:  Time  cost  for  eliminating  forwarding. _ 

_ testC  lassOrganizer _ 

!  cycles  42.56%  57.44%  100% 

'  extra  time  for  pipeline  bubbles  9.72 %  14,02%  12.19% 

_ testCompiler _ 

;  cycles  34.07%  65.93%  100% 

extra  time  for  pipeline  bubbles  10.26%  14,67%  13.17% 

testDccompiler 

cycles  32.38%  67.62%  100% 

extra  timcjorpipcline  bubbles  10.66%  16.88%  14.86% 

testPnnt  Definition 

cycles  38.09%  61.91%  100% 

extra  time  for  pipeline  bubbles  9.81%  21.31%  16.93% 

_ testPrinrHierarchy _ 

cycles  25.90%  74.10%  100% 

:  extra  time  for  pipeline  bubbles  10.39%  21.22%  18.41% 

average  of  macro-benchmarks 

I  cycles  34.60%  65.40%  100.00%- 

extra  time  for  pipeline  bubbles  10.17%  17.62%  15.11% 


vVvvvs*. 


Tabic  AJ5:  Instruction  vs.  Data  Fetches,  Part  1. 


ST 


all  instruction  references 

65.14% 

34.86% 

100% 

all  data  references 

32.08% 

67.92% 

100% 

all  data  *  instruction  references 

61.15% 

38.85% 

100% 

I-fetcbes  per  cycle 

90.73% 

71.56% 

82.98% 

I-flusbes  per  cycle 

3.15% 

9.33% 

5.65% 

D-fetches  per  cycle 

6.12% 

19.11% 

11.37% 

testActivanonRerum 


*  Our  simulator  computed  a  value  of  -0.24%  for  this  entry,  clear  evidence  that  our  instruction  c 


both 


Table  AJ5:  Instruction  vs.  Data  Fetches,  Part  2. 


ST  system 


testDecompiler 

all  instnicdon  references 

32.19% 

67.81% 

100% 

all  dau  references 

33.27% 

66.73% 

100% 

all  dau  +  instnicdon  references 

32.42% 

67.58% 

100% 

1-fetches  per  cycle 

68.17% 

68.76% 

68.57% 

1-flushes  per  cycle 

12.57% 

12.75% 

12.69% 

D-fetches  per  cycle 

19.26% 

18.50% 

18.74% 

testPriotDenni  don 


all  instnicdon  references 

38.01% 

61.99% 

all  dau  references 

36.82% 

63.18% 

all  dau  +  instnicdon  references 

37.78% 

62.22% 

1-fetches  per  cycle 

73.08% 

73.33% 

I-flushes  per  cycle 

10.32% 

9.14% 

D-fetches  per  cycle 

16.61% 

17.53% 

testPrintHicrarchy 

all  instnicdon  references 

26.25% 

73.75% 

100% 

all  dau  references 

23.28% 

76.72% 

100% 

all  dau  +  instruction  references 

25.62% 

74.38% 

100% 

1-fetches  per  cycle 

71.39% 

70.11% 

70.44% 

I-flushes  per  cycle 

11.66% 

10.36% 

10.70% 

D-fetches  per  cycle 

16.95% 

19-53% 

18.86% 

e  of  macro-benchmarks 


all  instnicdon  references 

all  dau  references 

all  dau  +  instruction  references 


34.19% 

65.81% 

100.00% 

33.57% 

66.43% 

100.00% 

34.06% 

65.94% 

100.00% 

71.88% 

72.40% 

72.18% 

9.86% 

8.64% 

9.07% 

18.26% 

18.96% 

18.75% 

1-fetches  per  cycle 
1-flushes  per  cycle 
D-fetches  per  cycle 


Appendix  B 


Ran-  SOAR  Data 


B.1.  Introduction 

This  appendix  contains  the  raw  data  we  gathered  and  used  for  the  calculations  in 
Appendix  A.  The  first  section  contains  instruction  mixes  for  die  second  iteration  of  several 
benchmarks.  These  were  run  in  an  image  that  was  modified  to  eliminate  almost  all 
occurrences  of  the  become  primitive,  as  outlined  in  Chapter  5.  The  second  section  contains 
execution  time  profiles  for  the  same  runs.  To  guide  die  reader  through  this  section,  we  have 
reprinted  pan  of  the  able  of  conana  in  Table  B.l. 
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B1  Instruction  Mix  Data 


condJumps:  the  number  of  times  a  jump  immediately  followed  a  skip. 


Table  Bi:  test3p)us4  Micro-Benchmark  Instruction  Mix. 


system  both  ! 


Steps  4642  ■ 

3332  2261  5593  i 


i 


Table  BJ:  testPooStorelnstanceVariable  Micro- Benchmark  Instruction  Mix. 


ST  system  bo 


9bex  tract 
%  insert 


Table  B.4:  testActivatior;  tetur  Micro-Benchmark  Instruction  Mix. 


system  both 


Steps 

Cycles 

463922 

19772 

356067 

483694 

WO 

SIS 

0 

515 

wu 

S13 

2 

515 

Ccodes 

3 

0 

3 

%nop 

1 

515 

516 

%ret 

0 

1 

1 

%iem 

1 

515 

516 

%retnw 

2 

3 

5 

retnw 

33280 

5 

33285 

%rcb 

0 

515 

515 

%redw 

0 

515 

515 

%slrip 

0 

1049 

1049 

skip 

32767 

0 

32767 

trapl 

0 

1 

1 

%trap2 

16383 

0 

16383 

%trap3 

32769 

1 

32770 

%  store 

0 

20 

20 

%storem 

0 

515 

515 

%k»ad 

32771 

534 

33305 

loadc 

32769 

1 

32770 

%loadm 

0 

515 

515 

%and 

0 

4 

4 

%or 

0 

3 

3 

%add 

81926 

2607 

84533 

%sub 

0 

1552 

1552 

sub 

32766 

0 

32766 

%ex  tract 

0 

1 

1 

%insen 

0 

3 

3 

%jump 

1033 

523 

1556 

jump 

1638S 

519 

16904 

%call 

0 

4 

4 

call 

33285 

4 

33289 

Tabic  B.5:  testClassOrgasizer  Macro-Benchmark  Instruction  Mix. 


ST 

system 

both  | 

WU  remw 

4692 

1856 

6548  I 

T1  trapl 

0 

9 

9  1 

TI  trap  3 

641 

0 

641  j 

GS  remw 

11 

0 

11  i 

GS  store 

14 

0 

14  i 

loadm  8 

0 

7037 

7037 

i 

| 

storem  5 

0 

11 

11 

storem  7 

5320 

435 

5755  i 

storem  8 

0 

7037 

7037  i 

ret*w’s 

75521 

55960 

131481  i 

nooNi!8-14 

192227 

224667 

416894  i 

int8-14 

68182 

131169 

199351  | 

always  %skip 
It  %skip 
h  skip 
ge  %skip 
geskip 
getrapl 
eq%skip 
eq  skip 
eq  %trap1 
eq  trapl 
eq  %trap4 
neftskip 
ne%trapi 
netnpl 
ne%trap3 
lc  %skip 
k  skip 
gt%skip 
gtskip 
gt%trapl 
gt  trapl 
Itu/inO  %skip 
geu/outO  %skip 
geu/outO  %  trapl 
geu/outO  trapl 
geu/outO  %trap2 
ku  %skip 

gtu  ftskip _ 

umaggedlmm  %ret 
untaggedlmm  %retw 
umaggedlmm  retw 
umaggedlmm  %rem 
umaggedlmm  rem 
umaggedlmm  %remw 
;  untaggedlmm  remw 
umaggedlmm  %reti 


3404 

0 

276 

0 

14 

0 

7179 

6013 

0 

0 

1318 

571 

0 

0 

57631 

0 

7952 

0 

459 

0 

0 

0 

0 

0 

0 

13507 

0 

_ 0_ 

0 

0 

0 

2051 

0 

6180 

16217 

0 


3 

8671 

5123 

8526 

144 

262 

36799 

442 

190 

1461 

12 

58466 

7612 

3982 

653 

17063 

242 

101 

5786 

136 

131 

747 

3928 

4528 

25536 

412 

1168 

791 

4623 

39 

522 

8599 

47 

13938 

83657 

22728 


4623 

39 

522 

10650 

47 

20118 

99874 

22728 


Table  BJ:  testClassOrsanizer  Macro-Benchmark  Instruction  Mix. 


untaggedlmm  %retiw 
untaggedlmm  %retinw 
untaggedlmm  %skip 
untaggedlmm  skip 
untaggedlmm  %trapl 
untaggedlmm  %load 
untaggedlmm  load 
untaggedlmm  loadc 
untaggedlmm  %loadm 
untaggedlmm  %xor 
untaggedlmm  %and 
untaggedlmm  %or 
untaggedlmm  %add 
untaggedlmm  add 
untaggedlmm  %sub 
untaggedlmm  sub 
untaggedlmm  %extract 
untaggedlmm  %insert 
taggedlmm  %skip 
tagged  1mm  %trapl 
taggedlmm  %trap2 
taggedlmm  %trap4 
taggedlmm  %load 
taggedlmm  %aod 
taggedlmm  %or 
taezedlmm  %add 


srl  barrel  shiner  savings 
forwarding  cost 
two- tone  savings 


47677 


111047 

209716 


Table  B.6:  testCompiler  Macro-Benchmark  Instruction  Mix. 


Steps 

Cycles 


370941 


system 


717817 


both 


743753 

1088758 


1557 


2  75 

947 

1  168 

179 

3  1 

1 

1  13688 

13689 

2378 

960 

3  320 

320 

2  4410 

5622 

3  81 

81 

Table  B.6:  testCompiler  Macro* Benchmark  Instruction  Mix. 


i  %retnw 

2422 

7362 

9784 

|  retnw 

8528 

23221 

31749 

i  %reti 

0 

7170 

7170 

> 

* 

i  %retiw 

0 

2729 

2729 

i 

!  %retinw 

0 

75 

75 

{  %sldp 

3737 

77074 

80811 

r 

1  «kip 

4810 

4342 

9152 

i  %trapl 

0 

2763 

2763 

:  trapl 

0 

7701 

7701 

i 

1  %trap2 

4735 

259 

4994 

k 

k 

!  %trap3 

18122 

878 

19000 

• 

1 

j  %trap4 

450 

19 

469 

;  %store 

1880 

16253 

18133 

1 

• 

:  store 

2973 

3476 

6449 

l 

j  %storcm 

1876 

3236 

5112 

• 

j  %load 

36937 

65008 

101945 

* 

)  load 

0 

5087 

5087 

1 

j  loadc 

18121 

1235 

19356 

i 

%loadm 

0 

3316 

3316 

b 

!  %srl 

0 

9388 

9388 

■» 

1  %sra 

0 

50 

50 

4 

i  sra 

0 

24 

24 

4 

j  %xor 

0 

1304 

1304 

s 

!  %and 

11 

13067 

13078 

I 

j  and 

30 

4 

34 

« 

i  %or 

451 

4818 

5269 

• 

» 

%add 

66485 

106045 

172530 

• 

1  add 

3094 

4385 

7479 

• 

1  %sU 

0 

5159 

5159 

1 

;  5,1 

0 

1406 

1406 

0 

1  %sub 

450 

20291 

20741 

a 

:  sub 

1124 

5830 

6954 

■ 

%extract 

0 

12979 

12979 

; 

;  %  insert 

0 

3697 

3697 

h 

;  %jump 

26002 

24280 

50282 

jump 

9420 

20049 

29469 

%call 

0 

6801 

6801 

call 

34157 

2548 

36705 

* 

TT  skip 

579 

1 

580 

TT  loadc 

2793 

17 

2810 

> 

WOO? 

2088 

641 

2729 

WUretw 

0 

146 

146 

WU  retnw 

1889 

694 

2583 

\ 

\ 

T1  trapl 

0 

75 

75 

TIoap3 

872 

0 

872 

GS  retnw 

4 

0 

4 

\ 

GS  store 

7 

168 

175 

loadm  7 


Table  B.6:  testCompiler  Macro-Benchmark  Instruction  Mix. 


;  Table  B.6:  testCompiler  Macro-Benchmark  Instruction  Mix. 


ST 

system 

both 

untaggedlmm  %retinw 

0 

75 

75 

untaggedlnun  %skip 

0 

9993 

9993 

untaggedlmm  skip 

1658 

485 

2143 

untaggedlmm  %trapl 

0 

74 

74 

untaggedlmm  %load 

33942 

47159 

81101 

untaggedlmm  load 

0 

5087 

5087 

untaggedlmm  loadc 

18121 

1235 

19356 

untaggedlmm  %loadm 

0 

3316 

3316 

untaggedlmm  %xor 

0 

447 

447 

untaggedlmm  %and 

0 

6924 

6924 

untaggedlmm  and 

17 

4 

21 

untaggedlmm  %or 

1 

147 

148 

untaggedlmm  %add 

8059 

70090 

78149 

untaggedlmm  add 

2189 

120 

2309 

untaggedlmm  %sub 

450 

17456 

17906 

untaggedlmm  sub 

542 

4359 

4901 

untaggedlmm  ^extract 

0 

10833 

10833 

untaggedlmm  %insen 

0 

2423 

2423 

tagged  1mm  %skip 

1058 

17170 

18228 

taggedlmm  %trapl 

952 

952 

taggedlmm  %  trap  2 

4735 

259 

4994 

taggedlmm  %trap4 

450 

19 

469 

taggedlmm  %load 

2995 

11425 

taggedlmm  %and 

11 

2317 

taggedlmm  %or 

450 

1995 

2445 

taggedlmm  %add 

3662 

8732 

12394 

sll  barrel  shifter  savings 

0 

3 

3 

srl  barrel  shifter  savings 

0 

1900 

1900 

sra  barrel  shifter  savings 

0 

24 

24 

forwarding  cost 

38049 

105324 

143373 

two- tone  savings 

68706 

91028 

159734 

condJumps 

3416 

28601 

32017 

Table  B.7:  testDecompiler  Macro-Benchmark  Instruction  Mix. 

*****  "  _ ___  i_ _ 


ST 

system 

both 

Steps 

Cycles 

936933 

1956663 

1983995 

2893596 

Ccodes 

6016 

6016 

TT 

8641 

6 

8647 

WO 

3225 

1548 

4773 

wu 

3433 

1340 

4773 

T1 

3217 

6 

3223 

able  B.7:  testDecompiler  Macro- Benchmark  Instruction  Mix. 


ST 

system 

both 

%retnw 

3975 

15526 

19501 

ROW 

21194 

63099 

84293 

%red 

0 

16637 

16637 

%retiw 

0 

4773 

4773 

%retinw 

0 

6 

6 

%skip 

4601 

236206 

240807 

skip 

15999 

8356 

24355 

fbtrapl 

0 

8682 

8682 

trap! 

0 

21010 

21010 

%trap2 

12417 

788 

13205 

%trap3 

45968 

3212 

49180 

%trap4 

1088 

82 

1170 

%store 

7926 

50609 

58535 

store 

5375 

4826 

10201 

%storem 

4680 

6919 

11599 

%load 

88555 

196228 

284783 

load 

0 

14998 

14998 

loadc 

45962 

3836 

49798 

%loadm 

0 

4773 

4773 

%srl 

0 

17159 

17159 

STB 

0 

2120 

2120 

%xor 

0 

6329 

6329 

%and 

31 

36239 

36270 

and 

$00 

0 

500 

%or 

1088 

10335 

11423 

«add 

186538 

309908 

496446 

add 

11775 

13398 

25173 

%sll 

0 

7956 

7956 

sU 

0 

1306 

1306 

%sub 

1088 

46940 

48028 

sub 

2890 

15654 

18544 

%extract 

0 

37296 

37296 

%insert 

0 

15013 

15013 

%jump 

60263 

64066 

124329 

jump 

22167 

52476 

74643 

%call 

0 

17876 

17876 

call 

84494 

7471 

91965 

TT  skip 

798 

0 

798 

TT  loadc 

7843 

6 

7849 

woo? 

3225 

1548 

4773 

WU  rerw 

0 

1 

1 

wu  ICDW 

3433 

1339 

4772 

TI  trap! 

0 

6 

6 

TItrap3 

3217 

0 

3217 

loadm  8 

0 

4773 

4773 

storem  4 

29 

0 

29 

storem  6 

29 

0 

29 

v.s . 


■gigia 


1  Table  B.7:  testDecompiler  Macro-Benchmark  Instruction  Mix. 

ST 

system 

both 

storem  7 

4622 

2146 

6768 

storem  8 

0 

4773 

4773 

ret*w*s 

55944 

48679 

104623 

nonNil8-14 

155632 

215034 

370666 

int8-14 

56499 

125311 

181810 

always  %skip 

1726 

61 

1787 

lt%sldp 

0 

44547 

44547 

It  skip 

797 

1852 

2649 

ge%skip 

0 

4500 

4500 

geskip 

1095 

135 

1230 

getrapl 

0 

168 

168 

eq  %skip 

2683 

40869 

43552 

eq  skip 

5332 

734 

6066 

eq  %trapl 

0 

1022 

1022 

eq  trap! 

0 

921 

921 

eq  $>trap4 

1088 

82 

1170 

ne  %skip 

192 

123615 

123807 

ne  skip 

0 

88 

88 

ne%trapl 

0 

5499 

5499 

ne  trap! 

0 

2875 

2875 

ne  %trap3 

45968 

3212 

49180 

le%skip 

0 

10193 

10193 

le  skip 

6112 

142 

6254 

gtskip 

1865 

5405 

7270 

gt  fctrapl 

0 

115 

115 

gt  trapl 

0 

84 

84 

ltu/inO  %skip 

0 

2961 

2961 

geu/outO  %skip 

0 

1015 

1015 

geu/outO  %  trapl 

0 

2046 

2046 

geu/outO  trapl 

0 

16384 

16384 

geu/outO  %trap2 

12417 

788 

13205 

leu  %skip 

0 

5107 

5107 

gtu  %skip 

0 

3338 

3338 

outl  trapl 

0 

578 

578 

untaggedimm  %ret 

0 

6092 

6092 

unuggedlmm  retw 

0 

24 

24 

untaggedimm  %rem 

4049 

8783 

12832 

untaggedimm  rem 

0 

534 

534 

untaggedimm  %retnw 

3521 

15496 

19017 

unuggedlmm  rctnw 

18215 

61790 

80005 

unuggedlmm  %reti 

0 

16637 

16637 

unuggedlmm  %retiw 

0 

4773 

4773 

untaggedimm  %retinw 

0 

6 

6 

unuggedlmm  %skip 

0 

28251 

28251 

unuggedlmm  skip 

6564 

0 

6564 

unuggedlmm  %  trapl 

0 

379 

379 

untaggedimm  %load 

82930 

153764 

236694 

unuggedlmm  load 

0 

14998 

14998 

unuggedlmm  loadc 

45962 

3836 

49798 

Tabic  B.7:  testDecompiler  Macro* Benchmark  Instruction  Mix. 


both 


untaggedlmm  %loadm 
untaggedlmm  %xor 
untaggedlmm  %and 
untaggedlmm  and 
untaggedlmm  %or 
untaggedlmm  %add 
untaggedlmm  add 
untaggedlmm  %sub 
untaggedlmm  sub 
untaggedlmm  ^extract 
untaggedlmm  %  insert 
taggedlmm  %skip 
taggedlmm  Sfetrapl 
taggedlmm  %trap2 
taggedlmm  %trap4 
taggedlmm  9bload 
taggedlmm  %and 
taggedlmm  %or 
taggedlmm  %add 


sri  barrel  shifter  savings 
forwarding  cost 
two- tone  savings 
condlum 


Table  B.8:  testPrintDennition  Macro-Benchmark  Instruction  Mix. 


system  both 


H 


28249 


I  VVV.  S  V 


Table  B J: 


%  store 

More 

Statorem 

Stload 

load 

loadc 

%loadm 

Starl 

Staor 

Stand 

and 

Star 

Stadd 

add 

Stall 

%sub 

sob 

Staxtract 
%  insert 

SEjump 

jump 

Staall 

call 


TT  skip 
TT  loadc 
WOO? 
WUretnw 
Htrap3 
GS  retnw 
GS  store 


loadm  8 

storemS 
storem  7 
storetn  8 


ret*ws 

nonNil8-14 

int8-14 


always  Stakip 


testPrintDefinltlon  Macro-Benchmark  Instruction  Mix. 


bo 


214 

38  0  38 

0  47  47 

2857  5571  8428 

0  866  866 

1648  38  1686 

0  20  20 

0  868  868 

0  621  621 

0  1238  1238 

1  0  1 

0  292  292 

6277  6745  13022 

469  460  929 

0  199  199 

0  726  726 

14  908  922 

0  2031  2031 

0  750  750 


eq  Sfctrapl 


1 

0 

0 

58 

0 

4 

0 

4 

3 

1379 

7 

0 

Table  BJ:  testPrintDefinition  Macro-Benchmark  IastructiOD  Mix. 


ST 

both 

ne  trap! 

0 

149 

149 

ne  %tnp3 

1648 

14 

1662 

le%skip 

0 

227 

227 

leskip 

417 

0 

417 

gtfeskip 

0 

1 

1 

gtskip 

2S8 

0 

258 

gt  %trapl 

0 

2 

2 

gttrapl 

0 

2 

2 

Itu/inO  %sldp 

0 

39 

39 

geu/outO  %skip 

0 

199 

199 

geu/outO  %trapl 

0 

202 

202 

geu/outO  trap! 

0 

1083 

1083 

geu/outO  %trap2 

282 

2 

284 

leu  %skip 

0 

64 

64 

gtu  %skip 

0 

42 

42 

untaggedlmin  %ret 

0 

161 

161 

untaggedlmm  %retw 

0 

3 

3 

untaggedlmin  %rem 

38 

43 

81 

untaggedlmm  feretnw 

160 

324 

484 

untaggedlmm  remw 

360 

2482 

2842 

untaggedlmm  %ren 

0 

250 

250 

untaggedlmm  %ietiw 

0 

20 

20 

untaggedlmm  %skip 

0 

992 

992 

untaggedlmm  skip 

26 

0 

26 

untaggedlmm  %trapl 

0 

4 

4 

untaggedlmm  %load 

2725 

3667 

6392 

untaggedlmm  load 

0 

866 

866 

untaggedlmm  loadc 

1648 

38 

1686 

untaggedlmm  %loadm 

0 

20 

20 

untaggedlmm  %xor 

0 

217 

217 

untaggedlmm  %and 

0 

910 

910 

untaggedlmm  %add 

329 

4601 

4930 

untaggedlmm  add 

469 

0 

469 

untaggedlmm  %sub 

0 

693 

693 

untaggedlmm  sub 

8 

679 

687 

untaggedlmm  %ex tract 

0 

1423 

1423 

untaggedlmm  %  insen 

0 

114 

114 

taggedlmm  %skip 

7 

531 

538 

taggedlmm  %trapl 

0 

202 

202 

taggedlmm  %crap2 

282 

2 

284 

taggedlmm  %load 

132 

972 

1104 

taggedlmm  %and 

0 

55 

55 

taggedlmm  %or 

0 

53 

53 

taggedlmm  %add 

412 

193 

605 

srl  barrel  shifter  savings 

0 

434 

434 

forwarding  cost 

2770 

9784 

12554 

two-tone  uvrngs 

5800 

9344 

15144 

condJumps 

299 

1281 

1580 

testPrintHierarchy  Macro- Benchmark  Instruction  Mix. 

ST 

system 

both 

Steps 

Cycles 

30458 

87127 

82833 

117585 

Ccodes 

193 

0 

193 

TT 

86 

0 

86 

WO 

117 

26 

143 

wu 

81 

62 

143 

T1 

85 

3 

88 

G5 

24 

0 

24 

fenop 

1 

169 

170 

feret 

0 

249 

249 

feictw 

0 

48 

48 

feretn 

109 

256 

365 

retn 

0 

8 

8 

feretnw 

208 

396 

604 

letnw 

618 

2329 

2947 

fereo 

0 

338 

338 

feietiw 

0 

143 

143 

feretinw 

0 

3 

3 

feskip 

261 

8996 

9257 

skip 

545 

35 

580 

fetrapl 

0 

1148 

1148 

trapl 

0 

1324 

1324 

fenp2 

303 

6 

309 

femp3 

1657 

98 

1755 

fetnp4 

13 

0 

13 

festore 

176 

2595 

2771 

stoic 

57 

30 

87 

festorem 

52 

265 

317 

feload 

2908 

10094 

13002 

load 

0 

890 

890 

loadc 

1657 

117 

1774 

feloadm 

0 

167 

167 

fesrl 

0 

1308 

1308 

fexor 

0 

1782 

1782 

feand 

0 

1955 

1955 

and 

4 

0 

4 

feor 

13 

643 

656 

feadd 

6770 

12942 

19712 

add 

452 

155 

607 

fesll 

0 

22 

22 

til 

0 

3 

3 

fesub 

13 

2218 

2231 

sub 

50 

513 

563 

feextract 

0 

2567 

2567 

fe  insert 

0 

1734 

1734 

%jump 

jump 

%call 


1870 

828 

0 


2S09 

2284 

547 


4379 

3112 

547 


1  Table  B.9:  testPrinlHierarchy  Macro- Benchmark  Instruction  Mix. 

ST 

system 

both 

call 

2986 

202 

3188 

TT  skip 

12 

0 

12 

TTloadc 

74 

0 

74 

WOO? 

117 

26 

143 

WU  retnw 

81 

62 

143 

T1  tntpl 

0 

3 

3 

T1  tntp3 

85 

0 

85 

GS  row 

12 

0 

12 

GS  store 

12 

0 

12 

loadm  7 

0 

24 

24 

loadm  8 

0 

143 

143 

storem  5 

0 

12 

12 

storetn  7 

52 

110 

162 

storem  8 

0 

143 

143 

ret*w’s 

1888 

1702 

nooNil8-14 

5682 

8475 

14157 

int8-14 

1344 

4622 

5966 

always  %skip 

45 

0 

45 

It  %skip 

0 

1362 

1362 

It  skip 

7 

11 

18 

ge  %skip 

0 

9 

9 

ge  skip 

5 

4 

9 

ge  trapl 

0 

8 

8 

eq  %skip 

216 

1552 

1768 

eq  skip 

51 

4 

55 

eq  %  trapl 

0 

24 

24 

eq  trapl 

0 

4 

4 

eq  %trap4 

13 

0 

13 

ne  %skip 

0 

5543 

5543 

Be  skip 

0 

2 

2 

ne  %  trapl 

0 

750 

750 

ne  trapl 

0 

152 

152 

ne  %trap3 

1657 

98 

1755 

le  %skip 

0 

108 

108 

le  skip 

377 

0 

377 

gtftskip 

0 

12 

12 

gt  skip 

93 

14 

107 

gt  %  trap  I 

0 

4 

4  j 

gt  trapl 

0 

4 

4 

Itu/ioO  %skip 

0 

108 

108 

geu/oatO  %skip 

0 

12 

12 

geu/outO  %trapl 

0 

370 

370 

geu/outO  trap! 

0 

1154 

1154 

geu/outO  %trap2 

303 

6 

309 

leu  %skip 

0 

182 

182 

gtu  %skip 

0 

108 

108 

outl  trapl 

0 

2 

2 

untaggedlmm  %ret 

0 

237 

_ 231, 

tV> 


1  Table  B.9:  testPrintHierarchy  Macro* Benchmark  Instruction  Mix. 

ST 

system 

both 

untaggedlmm  %retw 

0 

36 

36 

untaggedlmm  %rem 

109 

256 

365 

untaggedlmm  tern 

0 

8 

8 

untaggedlmm  %retnw 

200 

390 

590 

untaggedlmm  retnw 

545 

2273 

2818 

untaggedlmm  %red 

0 

338 

338 

untaggedlmm  %retiw 

0 

143 

143 

untaggedlmm  %rednw 

0 

3 

3 

untaggedlmm  %skip 

0 

1983 

1983 

untaggedlmm  skip 

81 

0 

81 

untaggedlmm  %trapl 

0 

8 

8 

untaggedlmm  %load 

2504 

7300 

9804 

untaggedlmm  load 

0 

890 

890 

untaggedlmm  loadc 

1657 

117 

1774 

untaggedlmm  feloadm 

0 

167 

167 

untaggedlmm  %xot 

0 

370 

370 

untaggedlmm  %and 

0 

887 

887 

untaggedlmm  %add 

728 

9243 

9971 

untaggedlmm  add 

441 

1 

442 

untaggedlmm  %sub 

13 

2067 

2080 

untaggedlmm  sub 

23 

417 

440 

untaggedlmm  %cxtract 

0 

891 

891 

untaggedlmm  %  insert 

0 

288 

288 

tagged!  mm  %skip 

35 

1752 

1787 

taggedlmm  %tnpl 

0 

370 

370 

taggedlmm  %trap2 

303 

6 

309 

taggedlmm  %trap4 

13 

0 

13 

taggedlmm  %load 

404 

1882 

2286 

taggedlmm  %and 

0 

241 

241 

taggedlmm  %or 

13 

195 

208 

taggedlmm  %add 

37 

669 

706 

srl  barrel  shifter  savings 

0 

647 

647 

forwarding  cost 

3166 

18485 

21651 

two-tone  savings 

6621 

8649 

15270 

condJumps 

363 

2385 

2748 

BJ.  Execution  Profile  Data 

The  data  in  diis  section  were  derived  by  modifying  the  simulator  to  sample  its  PC 
every  100  cycles,  and  using  an  awk  [AKW]  program  to  merge  the  samples  with  assembler’s 
symbol  table.  Instrumenting  the  simulator  instead  of  the  SOAR  program  enables  us  to 
profile  the  program  without  altering  its  behavior.  All  times  listed  in  this  appendix  are  given 
as  a  percentage  of  the  total  time.  For  an  explanation  of  the  primitive  numbers,  see  the 
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Smalltalk-80  book  by  Goldberg  and  Robson  lGoR83].  The  more  obscure  labels  can  only  be 
understood  by  reading  our  code. 


.12:  testCk  wOreanizer  Macro-Benchmark  Execution  Time  Profile. 


mall  talk 
WindowOverflowT  rapH 
SlQuoPrm 
String AtPrm 
Prim_60 
WSNextPutPrm 
WindowUnderflowTrapH 
SIMulPrm 

StringReplaceFromToWithStartingPnn 

lookupMethodlnClass 

Prim_62 

SkipTagTrapH 

RSNextPrm 

SISISIPrm 

LoadcTagTrapH 

BebavNew 

BCValuePnn2 

SYS_word_fiU 

SkipOnTrue 

SILTPrm 

SktpTagTrapS 

Prim_61 

Prim_110 

SkipTagTrapHidone 

blockCopy 

lookup 

Prim_81 

PSAtEndPrm 

bloc  kArrow  Return 

PailPrm 


» 


Tabic  B.13:  tesiCompiler  Macro-Benchmark  Execution  Time  Profile. 


03%  Prim_81 

03%  Prim  71 

03%  Prim"70 

03%  Prim j  10 

0.2%  methodBlockCopy 

0.2%  insert!04!sel!here 

03%  getWordSim 

;  03%  eqNewNewBecomc 

:  03%  argumentCount 
03%  SkipTagTrapH.’done 
03%  SlripOnFalse 

03%  Prim  111 

!  03%  PSAtEndPrra 
j  0.1%  insert!03!sel!herc 

:  0.1%  gsSurvivors 
|  0.1%  gsStoreGSTrapS 

j  0.1%  gsRemcmbcred 
0.1%  SVTiace 

i  0.1%  Prim_83 
i  0.1%  Prim~75 
0.1%  FailPrm 


Table  B.14:  testDecompiler  Macro-Benchmark  Execution  Time  Profile. 


Smalltalk 

213% 

lookupMethodln  Class 

7% 

BehavNew 

3.8% 

Prim_60 

3.7% 

WindowOverfiowTrapH 

3.7% 

WSNextPutPrm 

33% 

SYS_word_fill 

3.1% 

Prim_61 

2.7% 

WindowUnderflowTrapH 

2.1% 

SIQuoPrm 

2% 

lookup 

13% 

StringReplaceFromT o W  i  thS  tartingPnn 

13% 

StringAtPrm 

1.3% 

Prim_62 

1.1% 

allocSpace 

1.1% 

SlSlSlPrm 

1.1% 

BCValuePrm2 

1% 

biockCopy 

0.9% 

SIMulPnn 

0.8% 

LoadcTagTrapH 

03% 

cacheMissLookup 

03% 

SkipOnTme 

03% 

Prim_71 

03% 

Prim_70 

0.4% 

Try  Right 

0.3% 

other 

03% 

SkipTagTrapH 

